Skip to content

Instantly share code, notes, and snippets.

@bgschiller
Last active January 31, 2019 03:23
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bgschiller/610eb86d8da0e290b336 to your computer and use it in GitHub Desktop.
Save bgschiller/610eb86d8da0e290b336 to your computer and use it in GitHub Desktop.
SQL Join tutorial

Join and Window Function tutorials

These guides were originally prepared for internal training at TopOPPS.

Setup a database

You may be able to contact me to get credentials to access a running database. If not, follow these steps to get a database with the library schema.

1. Install PostgreSQL

(The best SQL database)

Mac:

Best bet is to use http://postgresapp.com/ .

Make sure to set up your path once it's installed:

  1. Run atom ~/.bash_profile.
  2. Add the line export PATH=$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin
  3. Save and close the file.

Windows:

Download the installer from http://www.enterprisedb.com/products-services-training/pgdownload#windows and run it. Update the PATH, as advised at https://stackoverflow.com/a/22921860/1586229. I don't have a Windows machine, so please comment here if you have more guidance.

Linux:

Use your distro's package manager to install postgres.

2. Create a database, and load the .pg_dump file

$ createdb library
$ gzip -d join_tutorial.pg_dump.gz
$ psql -d library < join_tutorial.pg_dump # This one will emit some errors like 'ERROR:  role "zampmgjjqmxrvg" does not exist', but you can safely ignore them
$ psql -d library -c 'SELECT * FROM loan LIMIT 5' # Check that everything worked okay
 id | book_copy_id |  date_out  |  date_due  | date_back  | borrower_id
----+--------------+------------+------------+------------+-------------
  1 |          525 | 2003-04-11 | 2003-05-02 | 2003-04-15 |          27
  2 |          462 | 2009-02-07 | 2009-02-28 | 2009-02-12 |          45
  4 |          573 | 2005-10-13 | 2005-11-03 | 2005-10-30 |          29
  5 |          112 | 2003-12-09 | 2003-12-30 | 2003-12-21 |          12
  6 |          496 | 2014-08-26 | 2014-09-16 | 2014-09-01 |          18
(5 rows)

Accessing the database

If you're hosting the db yourself, you'll want to enter the following connection info into your db client:

Host localhost
Port 5432
User <your computer username>
Password <empty>
Database library (or whatever you named it)

Mac:

I recommend Postico (made by the same people as postgresapp).

Windows and Linux:

Haven't explored much, but I've heard Navicat is good (paid). pgAdmin is popular and free, but (IMO) ugly. sqlectron is free and pretty, but it's an electron app (which might bother you). Let me know if there's an option here you recommend.

Other Resources

SELECT book.id, loan.borrower_id, COUNT(*) OVER (PARTITION BY book.id) AS total_borrows
FROM book
JOIN book_copy ON (book.id = book_copy.book_id)
JOIN loan ON (book_copy.id = loan.book_copy_id)

##SQL Joins

In the beginning, there were no joins. Well, there were joins, but they weren’t a part of the language; they existed in the way that people thought about writing their queries. But there was no JOIN keyword. Instead, people used what are now called 'implicit joins'. Even though we have explicit joins now, it’s instructive to look at what people did instead.

When you do SELECT * FROM author a WHERE a.name = 'J.K. Rowling', you can imagine that you're scanning the whole author table and selecting out only those rows where a.name is 'J.K. Rowling'. Your database promises to produce an answer that is equivalent to the one you would get if you imagine things working that way, even if it might take some shortcuts in computing it. For example, you might have an index on author.name, so the database knows exactly where to look to find the right rows.

When you run

SELECT * FROM author a, book b
WHERE b.year_published > 2007 AND a.zipcode = '63110'

imagine that you are taking every choice of author with every choice of book, then checking the where clause bits. In procedural code, this would looks like a nested for loop:

for a in authors:
    for b in books:
        if b.year_published > 2007 and a.zipcode == '63110':
            result_set.append((a, b))

This is called a CROSS JOIN. If it's not filtered, it produces size(left_table) * size(right_table) rows, one for every possible pair. All of our other joins are just going to be special WHERE clauses applied on top of a cross join.

More often, we're doing an INNER JOIN. This is the one people mean when they just say JOIN. It pairs up rows from the left and right tables, usually based on a foreign key. Something like this:

SELECT * FROM book b, author a
WHERE b.author_id = a.id

Again, you can think of this a looping over all possible pairs of (b, a), but in this case we only take the ones that make sense together. A cross join of book and author contains (for example) a row that puts Fahrenheit 451 together with 'Dr. Seuss'. But it's not clear what that's supposed to mean. In an inner join, we only have rows that match a book with the author of that very book. Using more modern syntax, we could write that as

SELECT *
FROM book b JOIN author a ON (b.author_id = a.id)

Something to note about an inner join is that rows don't have to appear. For example, Primary Colors was published anonymously, so its author_id field is NULL. There isn't an author with a NULL id, so it won't be able to find an author to pair with. It just won't show up anywhere in the table.

What if we wanted Primary Colors to show up in our result? We'd like to be able to say something like this

-- warning: this probably doesn't do what you want
SELECT * FROM book b, author a
WHERE b.author_id IS NULL or b.author_id = a.id

If you think back to our mental execution model, you might see the error here. We consider every possible pair (b, a) for inclusion, so Primary Colors gets paired with (for example) J.K. Rowling, Dr. Seuss, Ray Bradbury, ... In each case, it's author_id is still NULL, so that pairing will be included:

We could probably kludge together a query that did what we wanted without using the JOIN keyword, but we've reached about the limits of what implicit joins can do for us. Instead, let's write out what we want using explicit joins:

SELECT *
FROM book b LEFT OUTER JOIN author a ON (b.author_id = a.id)

Just as there's a LEFT OUTER JOIN, we also have a RIGHT OUTER JOIN.

In our context, we might use a RIGHT OUTER JOIN when we wanted to list even authors who don't have any books in our book table.

SELECT *
FROM book b RIGHT OUTER JOIN author a ON (b.author_id = a.id)

As you may have guessed, left and right outer joins are symmetric. Anytime you're doing x LEFT OUTER JOIN y it is equivalent to doing y RIGHT OUTER JOIN x. Most people tend to stick to left outer joins. (I just checked the TopOPPS codebase, and we have 17 left joins and only one right join).

Finally, there is the FULL OUTER JOIN. Every row from the left and right tables will be represented in the result, whether it has a mate from the other side or not.

SELECT *
FROM book b FULL OUTER JOIN author a ON (b.author_id = a.id)

Exercises

  1. Find the names of people are currently borrowing a book.
  2. Find the titles of books that were checked out on August 3, 2015 (You can get just write out '2015-08-03').
  3. Find the 25 books that were lent most recently (For this query, you'll need the ORDER BY and LIMIT words. For example: SELECT name FROM friends ORDER BY age LIMIT 5. You may want to do some googling to learn how they work. There's also the postgresql documentation).
  4. Find the names of borrowers who have ever borrowed a book by Phillip Pullman.
  5. Find borrowers who have a book written by Phillip Pullman currently on loan.

The rest of the exercises are more complicated. They involve GROUP BY and aggregation, which is a bit complicated. We can go over that on another day, or you can do some googling. The postgres tutorial site seems to have a good page on this idea.

  1. Find the top 10 most-checked-out books. What about currently-checked out?
  2. Find the borrowers with the most (in number) overdue books. What about by cost?
  3. Find the copy with the most use (most loans)
  4. List people who have read every Phillip Pullman book in the database.
  5. Find the borrowers who have read books that were also read by the people who have read every Phillip Pullman book in the database.
  6. Rank the borrowers by how many books they have in common with the borrower 'Williams Igo'. Note that a borrower may have checked out a book more than once. Only count distinct books.
  7. Score each pair of borrowers on their compatibility.
    1. Prevent borrowers from being compared with themselves.
    2. Make sure there's only row for each pair of people. ie: Lyndon Cammarata and Katheleen Baird should appear together in only one row, not one for "Lyndon vs Katheleen" and another for "Katheleen vs Lyndon".
  8. Tally the popularity of books published in each year between 1990 and 2000. Include a row for each year, even if the number of loans of books from that year is zero. (Hint: you may find a function called generate_series to be useful here)
  9. Find the biggest Dr. Seuss fan.

Window functions allow you to perform a calculation among some group of rows, and then each row gets to know about the result. This is hard to understand without an example so here we go. With a regular old group by, we can find the average cost of the copies of some books:

SELECT book_id, AVG(cost)
FROM book_copy
GROUP BY 1

But what if we wanted to know the copies of a book for which we overpaid. That is, not the most expensive books, but the copies of a book that were more than twice as expensive as the average copy of that same book. Here's an (in my opinion) inelegant way of doing this, with a join.

SELECT id, cost, cost::float / avg_cost AS normalized_cost
FROM book_copy JOIN (
    SELECT book_id, AVG(cost) AS avg_cost
    FROM book_copy
    GROUP BY 1
) avg_costs
ON (book_copy.book_id = avg_costs.book_id)
WHERE cost::float / avg_cost >= 2

Not that that's such a terrible query, but we're learning about window functions here. Let's do it with window functions. We'll build up to it in three steps. First, let's get a Listing of every book_copy compared to the average price of volumes in our library.

SELECT id, cost, AVG(cost) OVER () AS avg_volume_cost
FROM book_copy

It's the OVER () bit that tells us we're using a window function. A window function has

  • An aggregate. Here it's AVG(cost).
  • A partition (optional). We didn't use one.
  • An order (optional). We didn't use one.

We can get closer to the answer we want by adding the partition clause:

SELECT id, cost, AVG(cost) OVER (PARTITION BY book_id) AS avg_cost
FROM book_copy

The partition clause chops up the result set so that all the rows have the same value in each listed attribute (here just book_id). Our window function reads something like this in English, "Give me the average cost among all book_copys that have the same book_id as me".

And finally, the question we wanted to answer:

SELECT id, cost, normalized_cost
FROM (
    SELECT id, cost,
      cost::float / AVG(cost) OVER (PARTITION BY book_id)  AS normalized_cost
    FROM book_copy
) with_normalized_cost
WHERE normalized_cost >= 2

Hmm, that turns out not to be such a nice example. I forgot that window functions can't be used in the WHERE-clause of a query, so we have to do the window function calculation inside a subquery, then filter it afterwards. Oh well.

At TopOPPS, we commonly use window functions to query the event table. For instance when we're calculating the prior_value, we use something like this:

SELECT id, opportunity_id, created, event_value,
    lag(event_value) OVER (PARTITION BY opportunity_id ORDER BY created) AS prior_value
FROM api_event
WHERE organization_id = '00Do0000000JJfoEAG' AND event_type = 'expected_close_date'

Here, our window function reads "Give me the event_value from the row right before me (that's lag()), in the group that has the same opportunity_id as I do, ordered by created." Notice the NULL that sometimes appears in the results. That happens when there isn't a "row right before me". That is, this is the first with that opportunity_id, sorted by created.

Exercises

  1. Sometimes people use important papers as bookmarks and forget them there. Make a query listing who checked out each copy of a book immediately before the latest borrower.
  2. Find the books where one borrower consists of at least 30% of the loans on that book. If you need help getting started, this gets you some of the way there: exercise2_hint.sql
  3. Find the 10th borrower of each book (not book_copy). Hint: A list of some available window functions can be found at https://www.postgresql.org/docs/9.4/static/functions-window.html
@bgschiller
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment