- Purpose
- Nota bene
- Example schema
- Selecting records
- How do I get a list of items and the sum of another column in a table?
- How do I concatenate a list of string values each relavant to a row?
- How do I get a list of items and filter the result set on some aggregate?
- How do I join two tables and make sure rows don't drop out?
- How do I join two tables and ensure I get all rows back if some rows from one are missing from the other and vice versa?
- How do I compare two tables/record sets and see which rows exist in one but not the either?
- How do I join a table to the results of another query?
- How do I compute multiple aggregations at the same time without counting rows more than once?
- How do get a list of rows for which a certain column value does not exist in a table?
- How do get a list of rows for which a certain ID does exist in another table?
- How do I select records based on ranking of a column value?
- How do I select records based on ranking of values that do not exist in the database?
- How do I pivot row values to column values?
- How do I generate a sort index per aggregate?
- How do I get a more accurate ranking if there are "ties"?
- How do I generate rows for which there are no values for a specific column?
- How do I not only sort records but see their previous or next values in the same row?
- How do I get a running total of one column?
- How do I produce a histogram?
- How can I generate a recursive sequence?
- How do I visualize a hierarchy?
- How do I get totals aggregating by multiple different columns simultaneously?
This purpose of this is to document all the tricks that I've learned over the last couple of decades (good grief, has it really been that long?) from interacting with and developing for databases.
I have organized this as an FAQ of sorts rather than by technical concept, e.g., scalar subuery, chasm trap, etc., because when you don't know how to express something in code, you can't know what the solution type is called or named. My other endeavor is to capture the more thornier types of queries that can come up as requirements, instead of just itemizing trivial, common examples.
This is still a work in progress! And suggestions or comments are welcome!
All of the SQL statements in this can be used against the canonical supplier/parts tables used in many of C.J. Date's books, https://wiki.c2.com/?SupplierPartsDatabase, as well as the prototypical departments/employees tables installed by default in Oracle databases, https://www.orafaq.com/wiki/SCOTT
The tables used are:
id name status city
----------------------------
S1 Smith 20 London
S2 Jones 10 Paris
S3 Blake 30 Paris
S4 Clark 20 London
S5 Adams 30 Athens
id name color weight city
----------------------------------
P1 Nut Red 12.0 London
P2 Bolt Green 17.0 Paris
P3 Screw Blue 17.0 Oslo
P4 Screw Red 14.0 London
P5 Cam Blue 12.0 Paris
P6 Cog Red 19.0 London
supplier_id part_id quantity
------------------------------
S1 P1 300
S1 P2 200
S1 P3 400
S1 P4 200
S1 P5 100
S1 P6 100
S2 P1 300
S2 P2 400
S3 P2 200
S4 P2 200
S4 P4 300
S4 P5 400
... and for the other tables:
id | name | location
----+------------+----------
10 | ACCOUNTING | NEW YORK
20 | RESEARCH | DALLAS
30 | SALES | CHICAGO
40 | OPERATIONS | BOSTON
(4 rows)
id | name | job | manager_id | hire_date | salary | commission | department_id
------+--------+-----------+------------+------------+--------+------------+---------------
7839 | KING | PRESIDENT | | 1981-11-17 | 5000 | | 10
7698 | BLAKE | MANAGER | 7839 | 1981-05-01 | 2850 | | 30
7782 | CLARK | MANAGER | 7839 | 1981-06-09 | 2450 | | 10
7566 | JONES | MANAGER | 7839 | 1981-04-02 | 2975 | | 20
7788 | SCOTT | ANALYST | 7566 | 1987-06-13 | 3000 | | 20
7902 | FORD | ANALYST | 7566 | 1981-12-03 | 3000 | | 20
7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800 | | 20
7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 | 1600 | 300 | 30
7521 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250 | 500 | 30
7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 | 1250 | 1400 | 30
7844 | TURNER | SALESMAN | 7698 | 1981-09-08 | 1500 | 0 | 30
7876 | ADAMS | CLERK | 7788 | 1987-07-13 | 1100 | | 20
7900 | JAMES | CLERK | 7698 | 1981-12-03 | 950 | | 30
7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300 | | 10
(14 rows)
A SQL script is attached this gist to allow you to create the schema in your own database.
Let's say that we want to know which parts are available and the number of suppliers for each. In this case, we want to group by the part_id
, and simply count the number of supplier_id
s:
select part_id,
count(supplier_id)
from suppliers_parts
group by part_id;
part_id | count
---------+-------
P2 | 4
P1 | 2
P6 | 1
P4 | 2
P5 | 2
P3 | 1
(6 rows)
In this case, we still want to group by the part ID but we want to sum the quantity for each:
select part_id,
sum(quantity)
from suppliers_parts
group by part_id;
part_id | sum
---------+------
P2 | 1000
P1 | 600
P6 | 100
P4 | 500
P5 | 500
P3 | 400
(6 rows)
What if you wanted to only include those parts for which at least a certain about, say 500, were available? Then we can filter records out using the having
clause. You can think of it like where
: select
:: having
: group by
:
select part_id,
sum(quantity)
from suppliers_parts
group by part_id
having sum(quantity) > 500;
part_id | sum
---------+------
P2 | 1000
P1 | 600
(2 rows)
Let's say you wanted get a list of all the supplier names for each part. You can take advantage of string_agg
here just like you would any other SQL aggregate function like max
or sum
:
select sp.part_id,
string_agg(s.name, ', ') as supplier_list
from suppliers_parts sp
join suppliers s
on sp.supplier_id = s.id
group by sp.part_id;
part_id | supplier_list
---------+----------------------------
P2 | Smith, Jones, Blake, Clark
P1 | Smith, Jones
P6 | Smith
P4 | Smith, Clark
P5 | Smith, Clark
P3 | Smith
(6 rows)
In MySQL, you just need to use group_concat
:
select sp.part_id,
group_concat(s.name) as supplier_list
from suppliers_parts sp
join suppliers s
on sp.supplier_id = s.id
group by sp.part_id;
+---------+-------------------------+
| part_id | supplier_list |
+---------+-------------------------+
| P1 | Smith,Jones |
| P2 | Smith,Jones,Blake,Clark |
| P3 | Smith |
| P4 | Smith,Clark |
| P5 | Smith,Clark |
| P6 | Smith |
+---------+-------------------------+
6 rows in set (0.00 sec)
It's very common to need to join two or more tables but wind up dropping rows because certain rows from the first table don't have corresponding entries in the other. For instance, let's see we want to get a total count of quantities for all known parts, even if there are parts for which none is available. If we do this with a simple join
one of the suppliers drops out:
select s.id,
s.name,
sum(sp.quantity) as total_quantity
from suppliers s
join suppliers_parts sp
on s.id = sp.supplier_id
group by s.id,
s.name
order by s.id
id | name | total_quantity
----+-------+----------------
S1 | Smith | 1300
S2 | Jones | 700
S3 | Blake | 200
S4 | Clark | 900
(4 rows)
This is exactly what left outer join
is useful for:
select s.id,
s.name,
sum(sp.quantity) as total_quantity
from suppliers s
left outer join suppliers_parts sp
on s.id = sp.supplier_id
group by s.id,
s.name
order by s.id
id | name | total_quantity
----+-------+----------------
S1 | Smith | 1300
S2 | Jones | 700
S3 | Blake | 200
S4 | Clark | 900
S5 | Adams |
(5 rows)
How do I join two tables and ensure I get all rows back if some rows from one are missing from the other and vice versa?
This is a rather contrived example, but let's say we want to see more clearly which suppliers and parts are from the same city, and which suppliers lack an accompanying part and which parts don't have a supplier in the same city, all at once. For something like this, we can do a full outer join on the city
column:
select p.name,
p.city,
s.name,
s.city
from parts p
full outer join suppliers s
on p.city = s.city
name | city | name | city
-------+--------+-------+--------
Cog | London | Smith | London
Screw | London | Smith | London
Nut | London | Smith | London
Cam | Paris | Jones | Paris
Bolt | Paris | Jones | Paris
Cam | Paris | Blake | Paris
Bolt | Paris | Blake | Paris
Cog | London | Clark | London
Screw | London | Clark | London
Nut | London | Clark | London
| | Adams | Athens
Screw | Oslo | |
(12 rows)
You can clearly see that Adams has no part in Athens and the Screw has no supplier in Oslo.
One way to do this is indeed to use an exists
clause but if the two sets of records have the same structure, an easier way to express this is to use except
in PostgreSQL (or minus
in Oracle). Let's say we want to find out which cities making parts are not home to any suppliers?
select p.city
from parts p
except
select s.city
from suppliers s;
city
------
Oslo
(1 row)
You can easily answer the reverse question too:
select s.city
from suppliers s
except
select p.city
from parts p;
city
--------
Athens
(1 row)
Note that the shapes and types of the rows from each select
must be the same. You can also achieve this sort of thing using not exists
, but this pattern here is more commonly used with the two tables/record sets are similar in structure.
Oftentimes, you would like to join the results of one query to another table but can't do so easily. It is tempting to create a temporary table to do this sort fo thing, but you don't need to with the SQL in most databases. Instead you can use what is called an inline view to produce a "temporary" result set and then join to that.
In most cases, you don't really need an inline view but sometimes it's easier to express things in an inline view first rather writing one top level query to properly aggregrate everything all at once. Again, needing to do this is sometimes symptomatic of a non-optimal schema.
Let's say we want to first get the set of cities and number of parts made for each, and then merge that to the set of suppliers; we can do this:
select s.id,
s.name,
s.city,
c.part_count
from suppliers s
join (select p.city,
count(p.id) as part_count
from parts p
group by p.city) as c
on s.city = c.city
id | name | city | part_count
----+-------+--------+------------
S1 | Smith | London | 3
S2 | Jones | Paris | 2
S3 | Blake | Paris | 2
S4 | Clark | London | 3
(4 rows)
Note that one of the suppliers drops out above. One way around that is to use a so-called scalar subquery instead:
select s.id,
s.name,
s.city,
(select count(p.id)
from parts p
where p.city = s.city) as part_count
from suppliers s
id | name | city | count
----+-------+--------+-------
S1 | Smith | London | 3
S2 | Jones | Paris | 2
S3 | Blake | Paris | 2
S4 | Clark | London | 3
S5 | Adams | Athens | 0
(5 rows)
There are times which this is the easiest and most expressive way to obtain what you want but beware that this query is less performant as the scalar subquery is "executed" once per row in the outer query.
Let's say that ultimately we want a list of the suppliers, the count of parts available in the same city as each supplier, and the total quantities of all parts available for each supplier. We can first get suppliers and part counts with this query:
select s.id,
s.name,
s.city,
c.part_count
from suppliers s
join (select p.city,
count(p.id) as part_count
from parts p
group by p.city) as c
on s.city = c.city;
id | name | city | part_count
----+-------+--------+------------
S1 | Smith | London | 3
S2 | Jones | Paris | 2
S3 | Blake | Paris | 2
S4 | Clark | London | 3
(4 rows)
If we then add logic to get part quantities for each supplier we can arrive at this:
select s.id,
s.name,
count(p.id) as part_count,
sum(sp.quantity) as total_quantity
from suppliers s
join parts p
on s.city = p.city
join suppliers_parts sp
on s.id = sp.supplier_id
group by s.id,
s.name;
id | name | part_count | total_quantity
----+-------+------------+----------------
S3 | Blake | 2 | 400
S1 | Smith | 18 | 3900
S2 | Jones | 4 | 1400
S4 | Clark | 9 | 2700
(4 rows)
Uh oh... what happened? Why did the part counts suddenly go up? It's due to something called chasm trap, which is what happens when you join one table to two other tables and try aggregating on them both at the same time. By joining all three tables at onces like this, we effectively create a Cartesian product of records, thereby counting/summing column values more than once.
The way out of this problem is to inline each aggregation separately and then join them:
select pc.id,
pc.name,
pc.part_count,
tq.total_quantity
from (select s.id,
s.name,
count(p.id) as part_count
from suppliers s
join parts p
on s.city = p.city
group by s.id,
s.name) pc
join (select sp.supplier_id,
sum(sp.quantity) as total_quantity
from suppliers_parts sp
group by sp.supplier_id) as tq
on pc.id = tq.supplier_id;
id | name | part_count | total_quantity
----+-------+------------+----------------
S2 | Jones | 2 | 700
S1 | Smith | 3 | 1300
S3 | Blake | 2 | 200
S4 | Clark | 3 | 900
(4 rows)
At this point you may want to take advantage of a technique called subquery refactoring which allows you to move parts of the main query out into a with
clause
with part_counts as
(select s.id,
s.name,
count(p.id) as part_count
from suppliers s
join parts p
on s.city = p.city
group by s.id,
s.name),
total_quantities as
(select sp.supplier_id,
sum(sp.quantity) as total_quantity
from suppliers_parts sp
group by sp.supplier_id)
select pc.id,
pc.name,
pc.part_count,
tq.total_quantity
from part_counts pc
join total_quantities tq
on pc.id = tq.supplier_id;
This makes the main portion of the query more succuinct and elucidate what you actually want to do at the top level.
There's two ways to do this, one of which is more expressive of the other in my humble opinion. Let's say you wanted to figure out which of the suppliers has no parts? One way is to scan through the list of records in suppliers
and leverage a not exists
clause:
select s.id,
s.name
from suppliers s
where not exists
(select 'x'
from suppliers_parts sp
where sp.supplier_id = s.id);
id | name
----+-------
S5 | Adams
(1 row)
Note that the inner subquery just needs to return some value; it doesn't matter what it is. Another way is to do an outer join from suppliers
to suppliers_parts
and see which rows have a null
value for the part_id
:
select s.id,
s.name
from suppliers s
left outer join suppliers_parts sp
on s.id = sp.supplier_id
where sp.supplier_id is null;
id | name
----+-------
S5 | Adams
(1 row)
I feel like the first query has clearer intent.
There's again two ways to do this. Let's say you wanted to know which of the suppliers does offer parts? Likewise, one way is to scan through the list of records in suppliers
and leverage an exists
clause this time:
select s.id,
s.name
from suppliers s
where exists
(select 'x'
from suppliers_parts sp
where sp.supplier_id = s.id);
id | name
----+-------
S1 | Smith
S2 | Jones
S3 | Blake
S4 | Clark
(4 rows)
... and indeed we get the other four suppliers. You can also do this with an outer join from suppliers
to suppliers_parts
but this time you have to make sure you deduplicate the results:
select distinct s.id,
s.name
from suppliers s
join suppliers_parts sp
on s.id = sp.supplier_id;
id | name
----+-------
S3 | Blake
S4 | Clark
S2 | Jones
S1 | Smith
(4 rows)
Again, I feel like the first query has significantly clearer intent.
Oftentimes, legacy data are stored in tables that maintain their own history, either with a date column or sequence number column, and you need to be able to select only the records with the highest date or sequence number per some other colunn value.
We don't have a date column in this schema, so we'll manufacture an example here by asking: which of the suppliers_parts
records have the "maximum" value of the part ID per supplier ID. You could partially accomplish this just using max
and group by
except that selecting any other columns in the table will cause the query to bring back the maximum part ID per tuple of supplier ID and quantity:
select s.id,
max(sp1.part_id),
sp1.quantity
from suppliers s
join suppliers_parts sp1
on sp1.supplier_id = s.id
group by supplier_id,
quantity;
supplier_id | max | quantity
-------------+-----+----------
S2 | P1 | 300
S3 | P2 | 200
S1 | P6 | 100
S2 | P2 | 400
S4 | P5 | 400
S4 | P2 | 200
S1 | P1 | 300
S4 | P4 | 300
S1 | P4 | 200
S1 | P3 | 400
(10 rows)
What we really want to do is be able to pick only the records with the maximum part ID. We can do that with a subquery:
select s.id,
sp1.part_id,
sp1.quantity
from suppliers s
join suppliers_parts sp1
on sp1.supplier_id = s.id
where sp1.part_id =
(select max(sp2.part_id)
from suppliers_parts sp2
where sp2.supplier_id = sp1.supplier_id)
supplier_id | part_id | quantity
-------------+---------+----------
S1 | P6 | 100
S2 | P2 | 400
S3 | P2 | 200
S4 | P5 | 400
(4 rows)
But what if you wanted to also pick up records for which the supplier had no shipments? You can then utilize a left outer join
and checking for null
value on the joined column to ensure that you pick them up.
select s.id,
sp1.part_id,
sp1.quantity
from suppliers s
left outer join suppliers_parts sp1
on s.id = sp1.supplier_id
where sp1.supplier_id is null
or sp1.part_id =
(select max(sp2.part_id)
from suppliers_parts sp2
where sp2.supplier_id = sp1.supplier_id);
id | part_id | quantity
----+---------+----------
S1 | P6 | 100
S2 | P2 | 400
S3 | P2 | 200
S4 | P5 | 400
S5 | |
(5 rows)
Often you have to deal with legacy schemas which lack an explicit sorting or ranking column but there are column values on which you can infer/compute them. For example, let's say that we want to choose a supplier for each part based on a rank that's not directly available in any table. But we do have the names for each supplier and we know which ones are preferred over others by name. Imagine we have the following preference logic for suppliers: if Jones carries it, then choose that; if not and Clark has it then choose that, then Smith, then Adams, and then lastly Blake. We case exploit case
to set up a ranking score and then min
to select the best (in this case the minimum rank):
select sp.part_id,
sp.supplier_id,
s.name
from suppliers_parts sp
join suppliers s
on sp.supplier_id = s.id
where case
when s.name = 'Jones' then 1
when s.name = 'Clark' then 2
when s.name = 'Smith' then 3
when s.name = 'Adams' then 4
when s.name = 'Blake' then 5
end =
(select min(case
when s2.name = 'Jones' then 1
when s2.name = 'Clark' then 2
when s2.name = 'Smith' then 3
when s2.name = 'Adams' then 4
when s2.name = 'Blake' then 5
end)
from suppliers_parts sp2
join suppliers s2
on sp2.supplier_id = s2.id
where sp2.part_id = sp.part_id)
order by sp.part_id;
---------+-------------+-------
P1 | S2 | Jones
P2 | S2 | Jones
P3 | S1 | Smith
P4 | S4 | Clark
P5 | S4 | Clark
P6 | S1 | Smith
(6 rows)
If you look at all the rows in suppliers_parts
, you can verify that the suppliers listed above are the preferable ones according to the ranking of supplier names.
Let's say you wanted to tabulate which suppliers carried each part, with a column dedicated to each possible supplier. In this case, you want to effectively flip part of the suppliers_parts
table on its side, and select
ing each supplier:
select sp.part_id,
max(case when sp.supplier_id = 'S1' then 'X' else null end) as s1,
max(case when sp.supplier_id = 'S2' then 'X' else null end) as s2,
max(case when sp.supplier_id = 'S3' then 'X' else null end) as s3,
max(case when sp.supplier_id = 'S4' then 'X' else null end) as s4,
max(case when sp.supplier_id = 'S5' then 'X' else null end) as s5
from suppliers_parts sp
group by sp.part_id
order by sp.part_id;
part_id | s1 | s2 | s3 | s4 | s5
---------+----+----+----+----+----
P1 | X | X | | |
P2 | X | X | X | X |
P3 | X | | | |
P4 | X | | | X |
P5 | X | | | X |
P6 | X | | | |
(6 rows)
Let's say you wanted to get a list of all parts and suppliers, and wanted to sort and index each entry in suppliers_parts
by the supplier name. You can take advantage of windows functions again here
select sp.part_id,
p.name,
sp.supplier_id,
s.name,
row_number() over (partition by sp.part_id order by s.name) as sort_num
from suppliers_parts sp
join suppliers s
on sp.supplier_id = s.id
join parts p
on sp.part_id = p.id;
part_id | name | supplier_id | name | sort_num
---------+-------+-------------+-------+----------
P1 | Nut | S2 | Jones | 1
P1 | Nut | S1 | Smith | 2
P2 | Bolt | S3 | Blake | 1
P2 | Bolt | S4 | Clark | 2
P2 | Bolt | S2 | Jones | 3
P2 | Bolt | S1 | Smith | 4
P3 | Screw | S1 | Smith | 1
P4 | Screw | S4 | Clark | 1
P4 | Screw | S1 | Smith | 2
P5 | Cam | S4 | Clark | 1
P5 | Cam | S1 | Smith | 2
P6 | Cog | S1 | Smith | 1
(12 rows)
You can see above that the counts start over for each part ID.
There are times when row_number
is not sufficient to derive a proper rank; it's ok for sorting records but not good if you want to see when there are column values that repeat within a grouping. Consider the following:
select sp.supplier_id,
sp.part_id,
sp.quantity,
row_number() over (partition by sp.supplier_id order by sp.quantity desc) as sort_number
from suppliers_parts sp
supplier_id | part_id | quantity | sort_number
-------------+---------+----------+-------------
S1 | P3 | 400 | 1
S1 | P1 | 300 | 2
S1 | P4 | 200 | 3
S1 | P2 | 200 | 4
S1 | P5 | 100 | 5
S1 | P6 | 100 | 6
S2 | P2 | 400 | 1
S2 | P1 | 300 | 2
S3 | P2 | 200 | 1
S4 | P5 | 400 | 1
S4 | P4 | 300 | 2
S4 | P2 | 200 | 3
(12 rows)
For some of the suppliers above, there parts for which the quantities are the same and so you might want them ranked the same. If we want such duplicates taken into consideration, then we can use rank
instead:
select sp.supplier_id,
sp.part_id,
sp.quantity,
rank() over (partition by sp.supplier_id order by sp.quantity desc) as quantity_rank
from suppliers_parts sp;
supplier_id | part_id | quantity | quantity_rank
-------------+---------+----------+---------------
S1 | P3 | 400 | 1
S1 | P1 | 300 | 2
S1 | P4 | 200 | 3
S1 | P2 | 200 | 3
S1 | P5 | 100 | 5
S1 | P6 | 100 | 5
S2 | P2 | 400 | 1
S2 | P1 | 300 | 2
S3 | P2 | 200 | 1
S4 | P5 | 400 | 1
S4 | P4 | 300 | 2
S4 | P2 | 200 | 3
(12 rows)
We can further refine this query if we don't want to skip over ranks. That is, if for S1
above, we may want parts P5
, and P6
ranked 4 instead of 5. For that, we can use dense_rank
:
select sp.supplier_id,
sp.part_id,
sp.quantity,
dense_rank() over (partition by sp.supplier_id order by sp.quantity desc) as quantity_rank
from suppliers_parts sp;
supplier_id | part_id | quantity | quantity_rank
-------------+---------+----------+---------------
S1 | P3 | 400 | 1
S1 | P1 | 300 | 2
S1 | P4 | 200 | 3
S1 | P2 | 200 | 3
S1 | P5 | 100 | 4
S1 | P6 | 100 | 4
S2 | P2 | 400 | 1
S2 | P1 | 300 | 2
S3 | P2 | 200 | 1
S4 | P5 | 400 | 1
S4 | P4 | 300 | 2
S4 | P2 | 200 | 3
(12 rows)
There are times when you may want to display records corresponding to columns that have no rows in the database. For example, you may have to list sales or production data for a series of days or months, and for the instances that there are no such figures, you'd like to display a zero value nonetheless. It's situations like this that you'd like to somehow produce rows containing a range of values.
Different databases provide different means of doing this; in PostgreSQL, you can use the table function, generate_series
, to accomplish this. Since there are no data columns in our suppliers/parts schema, let's manufacture an example whereby we want to display the number of parts per weight (in this case from 10 to 20), but with the additional requirement that we want a zero count for weights that don't exist in the parts
table:
with all_weights(weight) as
(select *
from generate_series(10, 20))
select aw.weight,
count(p.id) as part_count
from all_weights aw
left outer join parts p
on p.weight = aw.weight
group by aw.weight
order by aw.weight;
weight | part_count
--------+------------
10 | 0
11 | 0
12 | 2
13 | 0
14 | 1
15 | 0
16 | 0
17 | 2
18 | 0
19 | 1
20 | 0
(11 rows)
There are times when it is really useful to be able to select current and previous values in the same row. Let's say we want to look at all suppliers and for each one show their respective quantities and parts, as well as ythe
select supplier_id,
part_id,
quantity,
lag(part_id) over (partition by supplier_id order by quantity) as previous_part_id,
lag(quantity) over (partition by supplier_id order by quantity) as previous_quantity
from suppliers_parts
order by supplier_id,
quantity;
supplier_id | part_id | quantity | previous_part_id | previous_quantity
-------------+---------+----------+------------------+-------------------
S1 | P5 | 100 | |
S1 | P6 | 100 | P5 | 100
S1 | P2 | 200 | P6 | 100
S1 | P4 | 200 | P2 | 200
S1 | P1 | 300 | P4 | 200
S1 | P3 | 400 | P1 | 300
S2 | P1 | 300 | |
S2 | P2 | 400 | P1 | 300
S3 | P2 | 200 | |
S4 | P2 | 200 | |
S4 | P4 | 300 | P2 | 200
S4 | P5 | 400 | P4 | 300
You can also get the next values within each row :
select supplier_id,
part_id,
quantity,
lead(part_id) over (partition by supplier_id order by quantity) as next_part_id,
lead(quantity) over (partition by supplier_id order by quantity) as next_quantity
from suppliers_parts
order by supplier_id,
quantity;
supplier_id | part_id | quantity | next_part_id | next_quantity
-------------+---------+----------+--------------+---------------
S1 | P5 | 100 | P6 | 100
S1 | P6 | 100 | P2 | 200
S1 | P2 | 200 | P4 | 200
S1 | P4 | 200 | P1 | 300
S1 | P1 | 300 | P3 | 400
S1 | P3 | 400 | |
S2 | P1 | 300 | P2 | 400
S2 | P2 | 400 | |
S3 | P2 | 200 | |
S4 | P2 | 200 | P4 | 300
S4 | P4 | 300 | P5 | 400
S4 | P5 | 400 | |
(12 rows)
Note that for the instances where there is no previous or next value, SQL will generate a null
.
Window functions come to our rescue again here; you can specify how specify which rows to consider for the window as well as how to order them. In the example below, we derive two running totals: one for each supplier sorted by part, and the other overall.
select supplier_id,
part_id,
sum(quantity) over (
partition by supplier_id
order by part_id
rows between unbounded preceding and current row) as running_supplier_total,
sum(quantity) over (
order by supplier_id, part_id
rows between unbounded preceding and current row) as running_overall_total
from suppliers_parts
order by supplier_id,
part_id;
supplier_id | part_id | running_supplier_total | running_overall_total
-------------+---------+------------------------+-----------------------
S1 | P1 | 300 | 300
S1 | P2 | 500 | 500
S1 | P3 | 900 | 900
S1 | P4 | 1100 | 1100
S1 | P5 | 1200 | 1200
S1 | P6 | 1300 | 1300
S2 | P1 | 300 | 1600
S2 | P2 | 700 | 2000
S3 | P2 | 200 | 2200
S4 | P2 | 200 | 2400
S4 | P4 | 500 | 2700
S4 | P5 | 900 | 3100
(12 rows)
It should be noted that the sorting in the windows are independent of the sorting in the main query... but is a good idea to keep them consistent with each other or else the result set will be correct but look weird:
select supplier_id,
part_id,
sum(quantity) over (
partition by supplier_id
order by part_id
rows between unbounded preceding and current row) as running_supplier_total,
sum(quantity) over (
order by supplier_id, part_id
rows between unbounded preceding and current row) as running_overall_total
from suppliers_parts
order by quantity;
supplier_id | part_id | running_supplier_total | running_overall_total
-------------+---------+------------------------+-----------------------
S1 | P5 | 1200 | 1200
S1 | P6 | 1300 | 1300
S1 | P2 | 500 | 500
S3 | P2 | 200 | 2200
S4 | P2 | 200 | 2400
S1 | P4 | 1100 | 1100
S1 | P1 | 300 | 300
S2 | P1 | 300 | 1600
S4 | P4 | 500 | 2700
S1 | P3 | 900 | 900
S4 | P5 | 900 | 3100
S2 | P2 | 700 | 2000
(12 rows)
Let's say you want to visualize the data in a database table somewhow, not just simply display raw numeric values. You can use repeat
to produce a string whose length is proportional to a corresponding value. Below we, select the total quantities of parts per supplier but in a histogram:
select supplier_id,
repeat('🔩', cast(sum(quantity)/100.0 as integer))
from suppliers_parts
group by supplier_id
order by supplier_id;
danielle=# \g
supplier_id | repeat
-------------+---------------
S1 | 🔩🔩🔩🔩🔩🔩🔩🔩🔩🔩🔩🔩🔩
S2 | 🔩🔩🔩🔩🔩🔩🔩
S3 | 🔩🔩
S4 | 🔩🔩🔩🔩🔩🔩🔩🔩🔩
(4 rows)
There may be an instance, say in a mathematical context, that you need to generate a sequence of values based on a recursive formula. This is a rather contrived example but let's say we want to produce the Fibonacci sequence, f(n+1) = f(n) + f(n-1). We can set this up by using two columns instead of one to establish the base cases of f(0) and f(1), and then recurse on that one row:
with recursive fibonacci(n, f_n, f_n_1) as
(select 0, 1, 1
union all
select n+1, f_n_1, f_n+f_n_1
from fibonacci)
select n, f_n
from fibonacci
limit 10;
n | f_n
---+-----
0 | 1
1 | 1
2 | 2
3 | 3
4 | 5
5 | 8
6 | 13
7 | 21
8 | 34
9 | 55
(10 rows)
Note that this statement gets more time-wise expensive the larger number of rows you request since there is no memoization and memory wise since more rows need to be generated the further into the sequence you want to go.
It's common in relational databases to store hierchical data in tables that are self-referential. Namely, each child record will have a reference to a parent record, and that record will have a reference to its own parent, and so on with a topmost record which will not have a parent ID.
The employees
table is such a table and we may need to display the organizational hierarchy of all the people in it. One way to do that is to print employees names and indent them sufficiently to imply their depth from the top. We can use a recursive query to start from the topmost record, i.e., the president, and "move down" the hierarchy via the parent ID.
But in order to represent the depth in the organization, we need to also track a level value which will start at 0 at the top and will increment with each level down. In the outermost query, we use the level to indent the name by a multiple of four spaces:
with recursive organization
(level, id, name, manager_id) as
(select 0, id, name, cast(null as numeric)
from employees
where job = 'PRESIDENT'
union all
select o.level+1, e.id, e.name, e.manager_id
from employees e
join organization o
on e.manager_id = o.id)
select id, repeat(' ', level*4) || name as name
from organization;
id | name
------+-------------------
7839 | KING
7698 | BLAKE
7782 | CLARK
7566 | JONES
7788 | SCOTT
7902 | FORD
7499 | ALLEN
7521 | WARD
7654 | MARTIN
7844 | TURNER
7900 | JAMES
7934 | MILLER
7369 | SMITH
7876 | ADAMS
(14 rows)
How about if we want to see the entire reporting line of each employee instead, not simply their depth or just each employee's manager? We can do that by building a string that represents the reporting line instead of a mere integer:
with recursive reporting_lines
(id, name, manager_id, reporting_line) as
(select id, name, cast(null as numeric), null
from employees
where job = 'PRESIDENT'
union all
select e.id, e.name, e.manager_id,
case
when rl.reporting_line is null then rl.name
else rl.reporting_line || ' -> ' || rl.name
end
from employees e
join reporting_lines rl
on e.manager_id = rl.id)
select id, name, reporting_line
from reporting_lines;
id | name | reporting_line
------+--------+------------------------
7839 | KING |
7698 | BLAKE | KING
7782 | CLARK | KING
7566 | JONES | KING
7788 | SCOTT | KING -> JONES
7902 | FORD | KING -> JONES
7499 | ALLEN | KING -> BLAKE
7521 | WARD | KING -> BLAKE
7654 | MARTIN | KING -> BLAKE
7844 | TURNER | KING -> BLAKE
7900 | JAMES | KING -> BLAKE
7934 | MILLER | KING -> CLARK
7369 | SMITH | KING -> JONES -> FORD
7876 | ADAMS | KING -> JONES -> SCOTT
(14 rows)
This sort of thing is a fairly common need for data analysts who need to be able to slice data in multiple ways. If you wanted to, say, get the sums of quantities of parts by both part number and supplier, you could just union two selects
:
select sum(quantity), supplier_id, null as part_id
from suppliers_parts
group by supplier_id
union all
select sum(quantity), null, part_id
from suppliers_parts
group by part_id;
sum | supplier_id | part_id
------+-------------+---------
1300 | S1 |
700 | S2 |
900 | S4 |
200 | S3 |
1000 | | P2
600 | | P1
100 | | P6
500 | | P4
500 | | P5
400 | | P3
(10 rows)
However, a pithier way of expressing this is to scan the table once and leverage grouping sets
to specify all the possible columns by which you want to group by. Postgres will fill in null
s for the column that is irrelevant at any one time.
select sum(quantity), supplier_id, part_id
from suppliers_parts
group by grouping sets(supplier_id, part_id);
sum | supplier_id | part_id
------+-------------+---------
1300 | S1 |
700 | S2 |
900 | S4 |
200 | S3 |
1000 | | P2
600 | | P1
100 | | P6
500 | | P4
500 | | P5
400 | | P3
(10 rows)