Skip to content

Instantly share code, notes, and snippets.

@quephird
Last active May 5, 2022 20:19
Show Gist options
  • Star 26 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save quephird/ddd745dd10037ca09ccec5aa60b281f2 to your computer and use it in GitHub Desktop.
Save quephird/ddd745dd10037ca09ccec5aa60b281f2 to your computer and use it in GitHub Desktop.

SQL Tricks

Contents

Purpose

This purpose of this is to document all the tricks that I've learned over the last couple of decades (good grief, has it really been that long?) from interacting with and developing for databases.

I have organized this as an FAQ of sorts rather than by technical concept, e.g., scalar subuery, chasm trap, etc., because when you don't know how to express something in code, you can't know what the solution type is called or named. My other endeavor is to capture the more thornier types of queries that can come up as requirements, instead of just itemizing trivial, common examples.

Nota bene

This is still a work in progress! And suggestions or comments are welcome!

Example schema

All of the SQL statements in this can be used against the canonical supplier/parts tables used in many of C.J. Date's books, https://wiki.c2.com/?SupplierPartsDatabase, as well as the prototypical departments/employees tables installed by default in Oracle databases, https://www.orafaq.com/wiki/SCOTT

The tables used are:

Suppliers

  id  name   status   city
  ----------------------------
  S1  Smith  20       London
  S2  Jones  10       Paris
  S3  Blake  30       Paris
  S4  Clark  20       London
  S5  Adams  30       Athens

Parts

  id  name   color  weight  city
  ----------------------------------
  P1  Nut    Red    12.0    London
  P2  Bolt   Green  17.0    Paris
  P3  Screw  Blue   17.0    Oslo
  P4  Screw  Red    14.0    London
  P5  Cam    Blue   12.0    Paris
  P6  Cog    Red    19.0    London

Supplier parts

  supplier_id  part_id  quantity
  ------------------------------
  S1           P1       300
  S1           P2       200
  S1           P3       400
  S1           P4       200
  S1           P5       100
  S1           P6       100
  S2           P1       300
  S2           P2       400
  S3           P2       200
  S4           P2       200
  S4           P4       300
  S4           P5       400

... and for the other tables:

Departments

 id |    name    | location
----+------------+----------
 10 | ACCOUNTING | NEW YORK
 20 | RESEARCH   | DALLAS
 30 | SALES      | CHICAGO
 40 | OPERATIONS | BOSTON
(4 rows)

Employees

  id  |  name  |    job    | manager_id | hire_date  | salary | commission | department_id
------+--------+-----------+------------+------------+--------+------------+---------------
 7839 | KING   | PRESIDENT |            | 1981-11-17 |   5000 |            |            10
 7698 | BLAKE  | MANAGER   |       7839 | 1981-05-01 |   2850 |            |            30
 7782 | CLARK  | MANAGER   |       7839 | 1981-06-09 |   2450 |            |            10
 7566 | JONES  | MANAGER   |       7839 | 1981-04-02 |   2975 |            |            20
 7788 | SCOTT  | ANALYST   |       7566 | 1987-06-13 |   3000 |            |            20
 7902 | FORD   | ANALYST   |       7566 | 1981-12-03 |   3000 |            |            20
 7369 | SMITH  | CLERK     |       7902 | 1980-12-17 |    800 |            |            20
 7499 | ALLEN  | SALESMAN  |       7698 | 1981-02-20 |   1600 |        300 |            30
 7521 | WARD   | SALESMAN  |       7698 | 1981-02-22 |   1250 |        500 |            30
 7654 | MARTIN | SALESMAN  |       7698 | 1981-09-28 |   1250 |       1400 |            30
 7844 | TURNER | SALESMAN  |       7698 | 1981-09-08 |   1500 |          0 |            30
 7876 | ADAMS  | CLERK     |       7788 | 1987-07-13 |   1100 |            |            20
 7900 | JAMES  | CLERK     |       7698 | 1981-12-03 |    950 |            |            30
 7934 | MILLER | CLERK     |       7782 | 1982-01-23 |   1300 |            |            10
(14 rows)

A SQL script is attached this gist to allow you to create the schema in your own database.

Selecting records

How do I get a list of items and the count of each in a table?

Let's say that we want to know which parts are available and the number of suppliers for each. In this case, we want to group by the part_id, and simply count the number of supplier_ids:

select  part_id,
        count(supplier_id)
from    suppliers_parts
group by part_id;

 part_id | count
---------+-------
 P2      |     4
 P1      |     2
 P6      |     1
 P4      |     2
 P5      |     2
 P3      |     1
(6 rows)

How do I get a list of items and the sum of another column in a table?

In this case, we still want to group by the part ID but we want to sum the quantity for each:

select  part_id,
        sum(quantity)
from    suppliers_parts
group by part_id;

 part_id | sum
---------+------
 P2      | 1000
 P1      |  600
 P6      |  100
 P4      |  500
 P5      |  500
 P3      |  400
(6 rows)

How do I get a list of items and filter the result set on some aggregate?

What if you wanted to only include those parts for which at least a certain about, say 500, were available? Then we can filter records out using the having clause. You can think of it like where : select :: having : group by:

select  part_id,
        sum(quantity)
from    suppliers_parts
group by part_id
having   sum(quantity) > 500;

 part_id | sum
---------+------
 P2      | 1000
 P1      |  600
(2 rows)

How do I concatenate a list of string values each relavant to a row?

Let's say you wanted get a list of all the supplier names for each part. You can take advantage of string_agg here just like you would any other SQL aggregate function like max or sum:

select  sp.part_id,
        string_agg(s.name, ', ') as supplier_list
from    suppliers_parts sp
join    suppliers s
on      sp.supplier_id = s.id
group by sp.part_id;

 part_id |       supplier_list
---------+----------------------------
 P2      | Smith, Jones, Blake, Clark
 P1      | Smith, Jones
 P6      | Smith
 P4      | Smith, Clark
 P5      | Smith, Clark
 P3      | Smith
(6 rows)

In MySQL, you just need to use group_concat:

select  sp.part_id,
        group_concat(s.name) as supplier_list
from    suppliers_parts sp
join    suppliers s
on      sp.supplier_id = s.id
group by sp.part_id;

+---------+-------------------------+
| part_id | supplier_list           |
+---------+-------------------------+
| P1      | Smith,Jones             |
| P2      | Smith,Jones,Blake,Clark |
| P3      | Smith                   |
| P4      | Smith,Clark             |
| P5      | Smith,Clark             |
| P6      | Smith                   |
+---------+-------------------------+
6 rows in set (0.00 sec)

How do I join two tables and make sure rows don't drop out?

It's very common to need to join two or more tables but wind up dropping rows because certain rows from the first table don't have corresponding entries in the other. For instance, let's see we want to get a total count of quantities for all known parts, even if there are parts for which none is available. If we do this with a simple join one of the suppliers drops out:

select  s.id,
        s.name,
        sum(sp.quantity) as total_quantity
from    suppliers s
join    suppliers_parts sp
on      s.id = sp.supplier_id
group by s.id,
         s.name
order by s.id

 id | name  | total_quantity
----+-------+----------------
 S1 | Smith |           1300
 S2 | Jones |            700
 S3 | Blake |            200
 S4 | Clark |            900
(4 rows)

This is exactly what left outer join is useful for:

select  s.id,
        s.name,
        sum(sp.quantity) as total_quantity
from    suppliers s
left outer join suppliers_parts sp
on      s.id = sp.supplier_id
group by s.id,
         s.name
order by s.id

 id | name  | total_quantity
----+-------+----------------
 S1 | Smith |           1300
 S2 | Jones |            700
 S3 | Blake |            200
 S4 | Clark |            900
 S5 | Adams |
(5 rows)

How do I join two tables and ensure I get all rows back if some rows from one are missing from the other and vice versa?

This is a rather contrived example, but let's say we want to see more clearly which suppliers and parts are from the same city, and which suppliers lack an accompanying part and which parts don't have a supplier in the same city, all at once. For something like this, we can do a full outer join on the city column:

select  p.name,
        p.city,
        s.name,
        s.city
from    parts p
full outer join suppliers s
on      p.city = s.city

 name  |  city  | name  |  city
-------+--------+-------+--------
 Cog   | London | Smith | London
 Screw | London | Smith | London
 Nut   | London | Smith | London
 Cam   | Paris  | Jones | Paris
 Bolt  | Paris  | Jones | Paris
 Cam   | Paris  | Blake | Paris
 Bolt  | Paris  | Blake | Paris
 Cog   | London | Clark | London
 Screw | London | Clark | London
 Nut   | London | Clark | London
       |        | Adams | Athens
 Screw | Oslo   |       |
(12 rows)

You can clearly see that Adams has no part in Athens and the Screw has no supplier in Oslo.

How do I compare two tables/record sets and see which rows exist in one but not the either?

One way to do this is indeed to use an exists clause but if the two sets of records have the same structure, an easier way to express this is to use except in PostgreSQL (or minus in Oracle). Let's say we want to find out which cities making parts are not home to any suppliers?

select  p.city
from    parts p
except
select  s.city
from    suppliers s;

 city
------
 Oslo
(1 row)

You can easily answer the reverse question too:

select  s.city
from    suppliers s
except
select  p.city
from    parts p;

  city
--------
 Athens
(1 row)

Note that the shapes and types of the rows from each select must be the same. You can also achieve this sort of thing using not exists, but this pattern here is more commonly used with the two tables/record sets are similar in structure.

How do I join a table to the results of another query?

Oftentimes, you would like to join the results of one query to another table but can't do so easily. It is tempting to create a temporary table to do this sort fo thing, but you don't need to with the SQL in most databases. Instead you can use what is called an inline view to produce a "temporary" result set and then join to that.

In most cases, you don't really need an inline view but sometimes it's easier to express things in an inline view first rather writing one top level query to properly aggregrate everything all at once. Again, needing to do this is sometimes symptomatic of a non-optimal schema.

Let's say we want to first get the set of cities and number of parts made for each, and then merge that to the set of suppliers; we can do this:

select  s.id,
        s.name,
        s.city,
        c.part_count
from    suppliers s
join   (select  p.city,
                count(p.id) as part_count
        from    parts p
        group by p.city) as c
on      s.city = c.city

 id | name  |  city  | part_count
----+-------+--------+------------
 S1 | Smith | London |          3
 S2 | Jones | Paris  |          2
 S3 | Blake | Paris  |          2
 S4 | Clark | London |          3
(4 rows)

Note that one of the suppliers drops out above. One way around that is to use a so-called scalar subquery instead:

select  s.id,
        s.name,
        s.city,
       (select  count(p.id)
        from    parts p
        where   p.city = s.city) as part_count
from    suppliers s

 id | name  |  city  | count
----+-------+--------+-------
 S1 | Smith | London |     3
 S2 | Jones | Paris  |     2
 S3 | Blake | Paris  |     2
 S4 | Clark | London |     3
 S5 | Adams | Athens |     0
(5 rows)

There are times which this is the easiest and most expressive way to obtain what you want but beware that this query is less performant as the scalar subquery is "executed" once per row in the outer query.

How do I compute multiple aggregations at the same time without counting rows more than once?

Let's say that ultimately we want a list of the suppliers, the count of parts available in the same city as each supplier, and the total quantities of all parts available for each supplier. We can first get suppliers and part counts with this query:

select  s.id,
        s.name,
        s.city,
        c.part_count
from    suppliers s
join   (select  p.city,
                count(p.id) as part_count
        from    parts p
        group by p.city) as c
on      s.city = c.city;

 id | name  |  city  | part_count
----+-------+--------+------------
 S1 | Smith | London |          3
 S2 | Jones | Paris  |          2
 S3 | Blake | Paris  |          2
 S4 | Clark | London |          3
(4 rows)

If we then add logic to get part quantities for each supplier we can arrive at this:

select  s.id,
        s.name,
        count(p.id) as part_count,
        sum(sp.quantity) as total_quantity
from    suppliers s
join    parts p
on      s.city = p.city
join    suppliers_parts sp
on      s.id = sp.supplier_id
group by s.id,
         s.name;

 id | name  | part_count | total_quantity
----+-------+------------+----------------
 S3 | Blake |          2 |            400
 S1 | Smith |         18 |           3900
 S2 | Jones |          4 |           1400
 S4 | Clark |          9 |           2700
(4 rows)

Uh oh... what happened? Why did the part counts suddenly go up? It's due to something called chasm trap, which is what happens when you join one table to two other tables and try aggregating on them both at the same time. By joining all three tables at onces like this, we effectively create a Cartesian product of records, thereby counting/summing column values more than once.

The way out of this problem is to inline each aggregation separately and then join them:

select  pc.id,
        pc.name,
        pc.part_count,
        tq.total_quantity
from   (select  s.id,
                s.name,
                count(p.id) as part_count
        from    suppliers s
        join    parts p
        on      s.city = p.city
        group by s.id,
                 s.name) pc
join   (select  sp.supplier_id,
                sum(sp.quantity) as total_quantity
        from    suppliers_parts sp
        group by sp.supplier_id) as tq
on      pc.id = tq.supplier_id;

 id | name  | part_count | total_quantity
----+-------+------------+----------------
 S2 | Jones |          2 |            700
 S1 | Smith |          3 |           1300
 S3 | Blake |          2 |            200
 S4 | Clark |          3 |            900
(4 rows)

At this point you may want to take advantage of a technique called subquery refactoring which allows you to move parts of the main query out into a with clause

with part_counts as
       (select  s.id,
                s.name,
                count(p.id) as part_count
        from    suppliers s
        join    parts p
        on      s.city = p.city
        group by s.id,
                 s.name),
total_quantities as
       (select  sp.supplier_id,
                sum(sp.quantity) as total_quantity
        from    suppliers_parts sp
        group by sp.supplier_id)
select  pc.id,
        pc.name,
        pc.part_count,
        tq.total_quantity
from    part_counts pc
join    total_quantities tq
on      pc.id = tq.supplier_id;

This makes the main portion of the query more succuinct and elucidate what you actually want to do at the top level.

How do get a list of rows for which a certain column value does not exist in a table?

There's two ways to do this, one of which is more expressive of the other in my humble opinion. Let's say you wanted to figure out which of the suppliers has no parts? One way is to scan through the list of records in suppliers and leverage a not exists clause:

select  s.id,
        s.name
from    suppliers s
where not exists
       (select  'x'
        from    suppliers_parts sp
        where   sp.supplier_id = s.id);

 id | name
----+-------
 S5 | Adams
(1 row)

Note that the inner subquery just needs to return some value; it doesn't matter what it is. Another way is to do an outer join from suppliers to suppliers_parts and see which rows have a null value for the part_id:

select  s.id,
        s.name
from    suppliers s
left outer join suppliers_parts sp
on      s.id = sp.supplier_id
where   sp.supplier_id is null;

 id | name
----+-------
 S5 | Adams
(1 row)

I feel like the first query has clearer intent.

How do get a list of rows for which a certain ID does exist in another table?

There's again two ways to do this. Let's say you wanted to know which of the suppliers does offer parts? Likewise, one way is to scan through the list of records in suppliers and leverage an exists clause this time:

select  s.id,
        s.name
from    suppliers s
where exists
       (select  'x'
        from    suppliers_parts sp
        where   sp.supplier_id = s.id);

 id | name
----+-------
 S1 | Smith
 S2 | Jones
 S3 | Blake
 S4 | Clark
(4 rows)

... and indeed we get the other four suppliers. You can also do this with an outer join from suppliers to suppliers_parts but this time you have to make sure you deduplicate the results:

select  distinct s.id,
        s.name
from    suppliers s
join    suppliers_parts sp
on      s.id = sp.supplier_id;

 id | name
----+-------
 S3 | Blake
 S4 | Clark
 S2 | Jones
 S1 | Smith
(4 rows)

Again, I feel like the first query has significantly clearer intent.

How do I select records based on ranking of a column value?

Oftentimes, legacy data are stored in tables that maintain their own history, either with a date column or sequence number column, and you need to be able to select only the records with the highest date or sequence number per some other colunn value.

We don't have a date column in this schema, so we'll manufacture an example here by asking: which of the suppliers_parts records have the "maximum" value of the part ID per supplier ID. You could partially accomplish this just using max and group by except that selecting any other columns in the table will cause the query to bring back the maximum part ID per tuple of supplier ID and quantity:

select  s.id,
        max(sp1.part_id),
        sp1.quantity
from    suppliers s
join suppliers_parts sp1
on      sp1.supplier_id = s.id
group by supplier_id,
         quantity;

 supplier_id | max | quantity
-------------+-----+----------
 S2          | P1  |      300
 S3          | P2  |      200
 S1          | P6  |      100
 S2          | P2  |      400
 S4          | P5  |      400
 S4          | P2  |      200
 S1          | P1  |      300
 S4          | P4  |      300
 S1          | P4  |      200
 S1          | P3  |      400
(10 rows)

What we really want to do is be able to pick only the records with the maximum part ID. We can do that with a subquery:

select  s.id,
        sp1.part_id,
        sp1.quantity
from    suppliers s
join suppliers_parts sp1
on      sp1.supplier_id = s.id
where   sp1.part_id =
       (select  max(sp2.part_id)
        from    suppliers_parts sp2
        where   sp2.supplier_id = sp1.supplier_id)

 supplier_id | part_id | quantity
-------------+---------+----------
 S1          | P6      |      100
 S2          | P2      |      400
 S3          | P2      |      200
 S4          | P5      |      400
(4 rows)

But what if you wanted to also pick up records for which the supplier had no shipments? You can then utilize a left outer join and checking for null value on the joined column to ensure that you pick them up.

select  s.id,
        sp1.part_id,
        sp1.quantity
from    suppliers s
left outer join suppliers_parts sp1
on      s.id = sp1.supplier_id
where   sp1.supplier_id is null
or      sp1.part_id =
       (select  max(sp2.part_id)
        from    suppliers_parts sp2
        where   sp2.supplier_id = sp1.supplier_id);

 id | part_id | quantity
----+---------+----------
 S1 | P6      |      100
 S2 | P2      |      400
 S3 | P2      |      200
 S4 | P5      |      400
 S5 |         |
(5 rows)

How do I select records based on ranking of values that do not exist in the database?

Often you have to deal with legacy schemas which lack an explicit sorting or ranking column but there are column values on which you can infer/compute them. For example, let's say that we want to choose a supplier for each part based on a rank that's not directly available in any table. But we do have the names for each supplier and we know which ones are preferred over others by name. Imagine we have the following preference logic for suppliers: if Jones carries it, then choose that; if not and Clark has it then choose that, then Smith, then Adams, and then lastly Blake. We case exploit case to set up a ranking score and then min to select the best (in this case the minimum rank):

select  sp.part_id,
        sp.supplier_id,
        s.name
from    suppliers_parts sp
join    suppliers s
on      sp.supplier_id = s.id
where   case
          when s.name = 'Jones' then 1
          when s.name = 'Clark' then 2
          when s.name = 'Smith' then 3
          when s.name = 'Adams' then 4
          when s.name = 'Blake' then 5
        end =
       (select  min(case
                      when s2.name = 'Jones' then 1
                      when s2.name = 'Clark' then 2
                      when s2.name = 'Smith' then 3
                      when s2.name = 'Adams' then 4
                      when s2.name = 'Blake' then 5
                    end)
        from    suppliers_parts sp2
        join    suppliers s2
        on      sp2.supplier_id = s2.id
        where   sp2.part_id = sp.part_id)
order by sp.part_id;

---------+-------------+-------
 P1      | S2          | Jones
 P2      | S2          | Jones
 P3      | S1          | Smith
 P4      | S4          | Clark
 P5      | S4          | Clark
 P6      | S1          | Smith
(6 rows)

If you look at all the rows in suppliers_parts, you can verify that the suppliers listed above are the preferable ones according to the ranking of supplier names.

How do I pivot row values to column values?

Let's say you wanted to tabulate which suppliers carried each part, with a column dedicated to each possible supplier. In this case, you want to effectively flip part of the suppliers_parts table on its side, and selecting each supplier:

select  sp.part_id,
        max(case when sp.supplier_id = 'S1' then 'X' else null end) as s1,
        max(case when sp.supplier_id = 'S2' then 'X' else null end) as s2,
        max(case when sp.supplier_id = 'S3' then 'X' else null end) as s3,
        max(case when sp.supplier_id = 'S4' then 'X' else null end) as s4,
        max(case when sp.supplier_id = 'S5' then 'X' else null end) as s5
from    suppliers_parts sp
group by sp.part_id
order by sp.part_id;

 part_id | s1 | s2 | s3 | s4 | s5
---------+----+----+----+----+----
 P1      | X  | X  |    |    |
 P2      | X  | X  | X  | X  |
 P3      | X  |    |    |    |
 P4      | X  |    |    | X  |
 P5      | X  |    |    | X  |
 P6      | X  |    |    |    |
(6 rows)

How do I generate a sort index per aggregate?

Let's say you wanted to get a list of all parts and suppliers, and wanted to sort and index each entry in suppliers_parts by the supplier name. You can take advantage of windows functions again here

select  sp.part_id,
        p.name,
        sp.supplier_id,
        s.name,
        row_number() over (partition by sp.part_id order by s.name) as sort_num
from    suppliers_parts sp
join    suppliers s
on      sp.supplier_id = s.id
join    parts p
on      sp.part_id = p.id;

 part_id | name  | supplier_id | name  | sort_num
---------+-------+-------------+-------+----------
 P1      | Nut   | S2          | Jones |        1
 P1      | Nut   | S1          | Smith |        2
 P2      | Bolt  | S3          | Blake |        1
 P2      | Bolt  | S4          | Clark |        2
 P2      | Bolt  | S2          | Jones |        3
 P2      | Bolt  | S1          | Smith |        4
 P3      | Screw | S1          | Smith |        1
 P4      | Screw | S4          | Clark |        1
 P4      | Screw | S1          | Smith |        2
 P5      | Cam   | S4          | Clark |        1
 P5      | Cam   | S1          | Smith |        2
 P6      | Cog   | S1          | Smith |        1
(12 rows)

You can see above that the counts start over for each part ID.

How do I get a more accurate ranking if there are "ties"?

There are times when row_number is not sufficient to derive a proper rank; it's ok for sorting records but not good if you want to see when there are column values that repeat within a grouping. Consider the following:

select  sp.supplier_id,
        sp.part_id,
        sp.quantity,
        row_number() over (partition by sp.supplier_id order by sp.quantity desc) as sort_number
from    suppliers_parts sp

 supplier_id | part_id | quantity | sort_number
-------------+---------+----------+-------------
 S1          | P3      |      400 |           1
 S1          | P1      |      300 |           2
 S1          | P4      |      200 |           3
 S1          | P2      |      200 |           4
 S1          | P5      |      100 |           5
 S1          | P6      |      100 |           6
 S2          | P2      |      400 |           1
 S2          | P1      |      300 |           2
 S3          | P2      |      200 |           1
 S4          | P5      |      400 |           1
 S4          | P4      |      300 |           2
 S4          | P2      |      200 |           3
(12 rows)

For some of the suppliers above, there parts for which the quantities are the same and so you might want them ranked the same. If we want such duplicates taken into consideration, then we can use rank instead:

select  sp.supplier_id,
        sp.part_id,
        sp.quantity,
        rank() over (partition by sp.supplier_id order by sp.quantity desc) as quantity_rank
from    suppliers_parts sp;

 supplier_id | part_id | quantity | quantity_rank
-------------+---------+----------+---------------
 S1          | P3      |      400 |             1
 S1          | P1      |      300 |             2
 S1          | P4      |      200 |             3
 S1          | P2      |      200 |             3
 S1          | P5      |      100 |             5
 S1          | P6      |      100 |             5
 S2          | P2      |      400 |             1
 S2          | P1      |      300 |             2
 S3          | P2      |      200 |             1
 S4          | P5      |      400 |             1
 S4          | P4      |      300 |             2
 S4          | P2      |      200 |             3
(12 rows)

We can further refine this query if we don't want to skip over ranks. That is, if for S1 above, we may want parts P5, and P6 ranked 4 instead of 5. For that, we can use dense_rank:

select  sp.supplier_id,
        sp.part_id,
        sp.quantity,
        dense_rank() over (partition by sp.supplier_id order by sp.quantity desc) as quantity_rank
from    suppliers_parts sp;

 supplier_id | part_id | quantity | quantity_rank
-------------+---------+----------+---------------
 S1          | P3      |      400 |             1
 S1          | P1      |      300 |             2
 S1          | P4      |      200 |             3
 S1          | P2      |      200 |             3
 S1          | P5      |      100 |             4
 S1          | P6      |      100 |             4
 S2          | P2      |      400 |             1
 S2          | P1      |      300 |             2
 S3          | P2      |      200 |             1
 S4          | P5      |      400 |             1
 S4          | P4      |      300 |             2
 S4          | P2      |      200 |             3
(12 rows)

How do I generate rows for which there are no values for a specific column?

There are times when you may want to display records corresponding to columns that have no rows in the database. For example, you may have to list sales or production data for a series of days or months, and for the instances that there are no such figures, you'd like to display a zero value nonetheless. It's situations like this that you'd like to somehow produce rows containing a range of values.

Different databases provide different means of doing this; in PostgreSQL, you can use the table function, generate_series, to accomplish this. Since there are no data columns in our suppliers/parts schema, let's manufacture an example whereby we want to display the number of parts per weight (in this case from 10 to 20), but with the additional requirement that we want a zero count for weights that don't exist in the parts table:

with all_weights(weight) as
       (select  *
        from    generate_series(10, 20))
select  aw.weight,
        count(p.id) as part_count
from    all_weights aw
left outer join parts p
on      p.weight = aw.weight
group by aw.weight
order by aw.weight;

 weight | part_count
--------+------------
     10 |          0
     11 |          0
     12 |          2
     13 |          0
     14 |          1
     15 |          0
     16 |          0
     17 |          2
     18 |          0
     19 |          1
     20 |          0
(11 rows)

How do I not only sort records but see their previous or next values in the same row?

There are times when it is really useful to be able to select current and previous values in the same row. Let's say we want to look at all suppliers and for each one show their respective quantities and parts, as well as ythe

select  supplier_id,
        part_id,
        quantity,
        lag(part_id) over (partition by supplier_id order by quantity) as previous_part_id,
        lag(quantity) over (partition by supplier_id order by quantity) as previous_quantity
from suppliers_parts
order by supplier_id,
         quantity;

 supplier_id | part_id | quantity | previous_part_id | previous_quantity
-------------+---------+----------+------------------+-------------------
 S1          | P5      |      100 |                  |
 S1          | P6      |      100 | P5               |               100
 S1          | P2      |      200 | P6               |               100
 S1          | P4      |      200 | P2               |               200
 S1          | P1      |      300 | P4               |               200
 S1          | P3      |      400 | P1               |               300
 S2          | P1      |      300 |                  |
 S2          | P2      |      400 | P1               |               300
 S3          | P2      |      200 |                  |
 S4          | P2      |      200 |                  |
 S4          | P4      |      300 | P2               |               200
 S4          | P5      |      400 | P4               |               300

You can also get the next values within each row :

select  supplier_id,
        part_id,
        quantity,
        lead(part_id) over (partition by supplier_id order by quantity) as next_part_id,
        lead(quantity) over (partition by supplier_id order by quantity) as next_quantity
from suppliers_parts
order by supplier_id,
         quantity;

 supplier_id | part_id | quantity | next_part_id | next_quantity
-------------+---------+----------+--------------+---------------
 S1          | P5      |      100 | P6           |           100
 S1          | P6      |      100 | P2           |           200
 S1          | P2      |      200 | P4           |           200
 S1          | P4      |      200 | P1           |           300
 S1          | P1      |      300 | P3           |           400
 S1          | P3      |      400 |              |
 S2          | P1      |      300 | P2           |           400
 S2          | P2      |      400 |              |
 S3          | P2      |      200 |              |
 S4          | P2      |      200 | P4           |           300
 S4          | P4      |      300 | P5           |           400
 S4          | P5      |      400 |              |
(12 rows)

Note that for the instances where there is no previous or next value, SQL will generate a null.

How do I getting a running title for a column?

Window functions come to our rescue again here; you can specify how specify which rows to consider for the window as well as how to order them. In the example below, we derive two running totals: one for each supplier sorted by part, and the other overall.

select  supplier_id,
        part_id,
        sum(quantity) over (
           partition by supplier_id
           order by part_id
           rows between unbounded preceding and current row) as running_supplier_total,
        sum(quantity) over (
           order by supplier_id, part_id
           rows between unbounded preceding and current row) as running_overall_total
from    suppliers_parts
order by supplier_id,
         part_id;

 supplier_id | part_id | running_supplier_total | running_overall_total
-------------+---------+------------------------+-----------------------
 S1          | P1      |                    300 |                   300
 S1          | P2      |                    500 |                   500
 S1          | P3      |                    900 |                   900
 S1          | P4      |                   1100 |                  1100
 S1          | P5      |                   1200 |                  1200
 S1          | P6      |                   1300 |                  1300
 S2          | P1      |                    300 |                  1600
 S2          | P2      |                    700 |                  2000
 S3          | P2      |                    200 |                  2200
 S4          | P2      |                    200 |                  2400
 S4          | P4      |                    500 |                  2700
 S4          | P5      |                    900 |                  3100
(12 rows)

It should be noted that the sorting in the windows are independent of the sorting in the main query... but is a good idea to keep them consistent with each other or else the result set will be correct but look weird:

select  supplier_id,
        part_id,
        sum(quantity) over (
           partition by supplier_id
           order by part_id
           rows between unbounded preceding and current row) as running_supplier_total,
        sum(quantity) over (
           order by supplier_id, part_id
           rows between unbounded preceding and current row) as running_overall_total
from    suppliers_parts
order by quantity;

 supplier_id | part_id | running_supplier_total | running_overall_total
-------------+---------+------------------------+-----------------------
 S1          | P5      |                   1200 |                  1200
 S1          | P6      |                   1300 |                  1300
 S1          | P2      |                    500 |                   500
 S3          | P2      |                    200 |                  2200
 S4          | P2      |                    200 |                  2400
 S1          | P4      |                   1100 |                  1100
 S1          | P1      |                    300 |                   300
 S2          | P1      |                    300 |                  1600
 S4          | P4      |                    500 |                  2700
 S1          | P3      |                    900 |                   900
 S4          | P5      |                    900 |                  3100
 S2          | P2      |                    700 |                  2000
(12 rows)

How do I produce a histogram?

Let's say you want to visualize the data in a database table somewhow, not just simply display raw numeric values. You can use repeat to produce a string whose length is proportional to a corresponding value. Below we, select the total quantities of parts per supplier but in a histogram:

select  supplier_id,
        repeat('🔩', cast(sum(quantity)/100.0 as integer))
from    suppliers_parts
group by supplier_id
order by supplier_id;
danielle=# \g
 supplier_id |    repeat
-------------+---------------
 S1          | 🔩🔩🔩🔩🔩🔩🔩🔩🔩🔩🔩🔩🔩
 S2          | 🔩🔩🔩🔩🔩🔩🔩
 S3          | 🔩🔩
 S4          | 🔩🔩🔩🔩🔩🔩🔩🔩🔩
(4 rows)

How can I generate a recursive sequence?

There may be an instance, say in a mathematical context, that you need to generate a sequence of values based on a recursive formula. This is a rather contrived example but let's say we want to produce the Fibonacci sequence, f(n+1) = f(n) + f(n-1). We can set this up by using two columns instead of one to establish the base cases of f(0) and f(1), and then recurse on that one row:

with recursive fibonacci(n, f_n, f_n_1) as
       (select  0, 1, 1
        union all
        select  n+1, f_n_1, f_n+f_n_1
        from   fibonacci)
select  n, f_n
from    fibonacci
limit 10;

 n | f_n
---+-----
 0 |   1
 1 |   1
 2 |   2
 3 |   3
 4 |   5
 5 |   8
 6 |  13
 7 |  21
 8 |  34
 9 |  55
(10 rows)

Note that this statement gets more time-wise expensive the larger number of rows you request since there is no memoization and memory wise since more rows need to be generated the further into the sequence you want to go.

How do I visualize a hierarchy?

It's common in relational databases to store hierchical data in tables that are self-referential. Namely, each child record will have a reference to a parent record, and that record will have a reference to its own parent, and so on with a topmost record which will not have a parent ID.

The employees table is such a table and we may need to display the organizational hierarchy of all the people in it. One way to do that is to print employees names and indent them sufficiently to imply their depth from the top. We can use a recursive query to start from the topmost record, i.e., the president, and "move down" the hierarchy via the parent ID.

But in order to represent the depth in the organization, we need to also track a level value which will start at 0 at the top and will increment with each level down. In the outermost query, we use the level to indent the name by a multiple of four spaces:

with recursive organization
       (level, id, name, manager_id) as
       (select  0, id, name, cast(null as numeric)
        from    employees
        where   job = 'PRESIDENT'
        union all
        select  o.level+1, e.id, e.name, e.manager_id
        from    employees e
        join    organization o
        on      e.manager_id = o.id)
select  id, repeat(' ', level*4) || name as name
from    organization;

  id  |       name
------+-------------------
 7839 | KING
 7698 |     BLAKE
 7782 |     CLARK
 7566 |     JONES
 7788 |         SCOTT
 7902 |         FORD
 7499 |         ALLEN
 7521 |         WARD
 7654 |         MARTIN
 7844 |         TURNER
 7900 |         JAMES
 7934 |         MILLER
 7369 |             SMITH
 7876 |             ADAMS
(14 rows)

How about if we want to see the entire reporting line of each employee instead, not simply their depth or just each employee's manager? We can do that by building a string that represents the reporting line instead of a mere integer:

with recursive reporting_lines
       (id, name, manager_id, reporting_line) as
       (select  id, name, cast(null as numeric), null
        from    employees
        where   job = 'PRESIDENT'
        union all
        select  e.id, e.name, e.manager_id,
                case
                   when rl.reporting_line is null then rl.name
                   else rl.reporting_line || ' -> ' || rl.name
                end
        from    employees e
        join    reporting_lines rl
        on      e.manager_id = rl.id)
select  id, name, reporting_line
from    reporting_lines;

 id  |  name  |     reporting_line
------+--------+------------------------
 7839 | KING   |
 7698 | BLAKE  | KING
 7782 | CLARK  | KING
 7566 | JONES  | KING
 7788 | SCOTT  | KING -> JONES
 7902 | FORD   | KING -> JONES
 7499 | ALLEN  | KING -> BLAKE
 7521 | WARD   | KING -> BLAKE
 7654 | MARTIN | KING -> BLAKE
 7844 | TURNER | KING -> BLAKE
 7900 | JAMES  | KING -> BLAKE
 7934 | MILLER | KING -> CLARK
 7369 | SMITH  | KING -> JONES -> FORD
 7876 | ADAMS  | KING -> JONES -> SCOTT
(14 rows)

How do I get totals aggregating by multiple different columns simultaneously?

This sort of thing is a fairly common need for data analysts who need to be able to slice data in multiple ways. If you wanted to, say, get the sums of quantities of parts by both part number and supplier, you could just union two selects:

select  sum(quantity), supplier_id, null as part_id
from    suppliers_parts
group by supplier_id
union all 
select  sum(quantity), null, part_id
from    suppliers_parts
group by part_id;

 sum  | supplier_id | part_id
------+-------------+---------
 1300 | S1          |
  700 | S2          |
  900 | S4          |
  200 | S3          |
 1000 |             | P2
  600 |             | P1
  100 |             | P6
  500 |             | P4
  500 |             | P5
  400 |             | P3
(10 rows)

However, a pithier way of expressing this is to scan the table once and leverage grouping sets to specify all the possible columns by which you want to group by. Postgres will fill in nulls for the column that is irrelevant at any one time.

select  sum(quantity), supplier_id, part_id
from    suppliers_parts
group by grouping sets(supplier_id, part_id);

 sum  | supplier_id | part_id
------+-------------+---------
 1300 | S1          |
  700 | S2          |
  900 | S4          |
  200 | S3          |
 1000 |             | P2
  600 |             | P1
  100 |             | P6
  500 |             | P4
  500 |             | P5
  400 |             | P3
(10 rows)

Mutating records

CTAS

Merge statement or equivalent

Deduping records

Data dictionary

all columns

all tables

all constraints

all indices

create table suppliers
(id varchar(2),
name varchar(20) not null,
status numeric not null,
city varchar(20) not null,
primary key (id));
insert into suppliers
(id, name, status, city)
values ('S1', 'Smith', 20, 'London'),
('S2', 'Jones', 10, 'Paris'),
('S3', 'Blake', 30, 'Paris'),
('S4', 'Clark', 20, 'London'),
('S5', 'Adams', 30, 'Athens');
create table parts
(id varchar(2),
name varchar(20) not null,
color varchar(20) not null,
weight numeric not null,
city varchar(20) not null,
primary key (id));
insert into parts
(id, name, color, weight, city)
values ('P1', 'Nut', 'Red', 12.0, 'London'),
('P2', 'Bolt', 'Green', 17.0, 'Paris'),
('P3', 'Screw', 'Blue', 17.0, 'Oslo'),
('P4', 'Screw', 'Red', 14.0, 'London'),
('P5', 'Cam', 'Blue', 12.0, 'Paris'),
('P6', 'Cog', 'Red', 19.0, 'London');
create table suppliers_parts
(supplier_id varchar(2),
part_id varchar(2),
quantity numeric not null,
foreign key (supplier_id) references suppliers (id),
foreign key (part_id) references parts (id));
insert into suppliers_parts
(supplier_id, part_id, quantity)
values ('S1', 'P1', 300),
('S1', 'P2', 200),
('S1', 'P3', 400),
('S1', 'P4', 200),
('S1', 'P5', 100),
('S1', 'P6', 100),
('S2', 'P1', 300),
('S2', 'P2', 400),
('S3', 'P2', 200),
('S4', 'P2', 200),
('S4', 'P4', 300),
('S4', 'P5', 400);
create table departments
(id numeric,
name varchar(20) not null,
location varchar(20) not null,
primary key (id));
insert into departments
(id, name, location)
values (10, 'ACCOUNTING', 'NEW YORK'),
(20, 'RESEARCH', 'DALLAS'),
(30, 'SALES', 'CHICAGO'),
(40, 'OPERATIONS', 'BOSTON');
create table employees
(id numeric,
name varchar(10) not null,
job varchar(10) not null,
manager_id numeric,
hire_date date not null,
salary numeric not null,
commission numeric,
department_id numeric not null,
primary key (id),
foreign key (manager_id) references employees(id),
foreign key (department_id) references departments(id));
insert into employees
values (7839, 'KING', 'PRESIDENT', null, to_date('17-11-1981','dd-mm-yyyy'), 5000, null, 10),
(7698, 'BLAKE', 'MANAGER', 7839, to_date('1-5-1981','dd-mm-yyyy'), 2850, null, 30),
(7782, 'CLARK', 'MANAGER', 7839, to_date('9-6-1981','dd-mm-yyyy'), 2450, null, 10),
(7566, 'JONES', 'MANAGER', 7839, to_date('2-4-1981','dd-mm-yyyy'), 2975, null, 20),
(7788, 'SCOTT', 'ANALYST', 7566, to_date('13-6-1987','dd-mm-yyyy'), 3000, null, 20),
(7902, 'FORD', 'ANALYST', 7566, to_date('3-12-1981','dd-mm-yyyy'), 3000, null, 20),
(7369, 'SMITH', 'CLERK', 7902, to_date('17-12-1980','dd-mm-yyyy'), 800, null, 20),
(7499, 'ALLEN', 'SALESMAN', 7698, to_date('20-2-1981','dd-mm-yyyy'), 1600, 300, 30),
(7521, 'WARD', 'SALESMAN', 7698, to_date('22-2-1981','dd-mm-yyyy'), 1250, 500, 30),
(7654, 'MARTIN', 'SALESMAN', 7698, to_date('28-9-1981','dd-mm-yyyy'), 1250, 1400, 30),
(7844, 'TURNER', 'SALESMAN', 7698, to_date('8-9-1981','dd-mm-yyyy'), 1500, 0, 30),
(7876, 'ADAMS', 'CLERK', 7788, to_date('13-7-1987', 'dd-mm-yyyy'), 1100, null, 20),
(7900, 'JAMES', 'CLERK', 7698, to_date('3-12-1981','dd-mm-yyyy'), 950, null, 30),
(7934, 'MILLER', 'CLERK', 7782, to_date('23-1-1982','dd-mm-yyyy'), 1300, null, 10);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment