On the most basic level, the way we learned to query through a table in one of our databases is pretty simple.
The SQL SELECT
statement allows us to search through a table and return a custom results table with whatever columns/groupings/operations we specified in our query.
SELECT * FROM movies WHERE rottentomatoes > 85 AND lead_actor like 'Kevin Spacey';
Unfortunately, sometimes the queries need an added level of complexity and a simple statement like the one above wont do the trick. Here are two scenarios where we need to add some intricacy to our SQL statement in order to perform a sufficient query:
- We want to query for consistencies between two separate tables that are connected through a reference key.
- We want to query for AND consolidate ALL the references some 'parent' table has to all its children: whether its through a foreign key table reference or through an 'xref' table of multiple item references.
Hopefully by the end of this wiki page, you should have a better understanding on what the above two statements mean and how to implement them in a single SQL query statement.
SQL JOIN's are used to combine rows from 2+ tables based on a common field between them.
Below is an image that does an excellent job representing all of the different SQL joins and how to implement them (credit to www.codeproject.com).
I didn't get this at first either, but looking at a real example helped me understand exactly what the different statements mean.
Table A: ga_students_names
ida | last | first |
---|---|---|
1 | Seminara | Jagermeister |
3 | Hart | Colossus |
4 | Petrov | Petunia |
Table B: ages_residencies_of_ga_students
idb | age | residency |
---|---|---|
1 | 59 | Battery Park |
2 | 12 | Lenox Hill |
4 | 59 | GA 4th Floor |
- Gives the intersection of the two tables, or the rows they have common where Table A reference = Table B reference.
SELECT gsn.first, gsn.last, arga.age, arga.residency
FROM ga_students_names as gsn
INNER JOIN ages_residencies_of_ga_students as arga
ON gsn.ida = arga.idb;
This query will return all of the matching items in the SELECT
statement where ida and idb are present in BOTH tables.
first | last | age | residency |
---|---|---|---|
Jagermeister | Seminara | 59 | Battery Park |
Petunia | Petrov | 59 | GA 4th Floor |
The best way to vizualize exactly how JOINs work is to see what would happen if we alter the SELECT
statement to a *
query:
SELECT *
FROM ga_students_names as gsn
INNER JOIN ages_residencies_of_ga_students as arga
ON gsn.ida = arga.idb;
Which will return the table
ida | last | first | idb | age | residency |
---|---|---|---|---|---|
1 | Seminara | Jagermeister | 1 | 59 | Battery Park |
4 | Petrov | Petunia | 4 | 59 | GA 4th Floor |
As you can see, the statement INHERINTLY joins the rows with shared ID's, so all you have to do is:
- specify the columns you want
- refer to each desired column by calling its table (object-like syntax)
- specify the order that you want the items in!
- LEFT: Gives ALL selected rows from the LEFT table in the
ON
statement and any common selected rows from the RIGHT table.
LEFT and RIGHT joins solely differ from each other by the ordering of the syntax in the ON
section of the query statement.
For these statements, unlike INNER JOIN, the ORDER MATTERS!
Lets take the same query above and change INNER
to LEFT
and see the differences:
SELECT gsn.first, gsn.last, arga.age, arga.residency
FROM ga_students_names as gsn
LEFT JOIN ages_residencies_of_ga_students as arga
ON gsn.ida = arga.idb;
(left: gsn.id) = (right: gsn.idb)
first | last | age | residency |
---|---|---|---|
Jagermeister | Seminara | 59 | Battery Park |
Petunia | Petrov | 59 | GA 4th Floor |
Colossus | Hart | ||
As you can see, LEFT JOIN operates in two steps: |
- Ignore the join statement, print all lines in the LEFT table.
- Enter RIGHT table:
- if the
ON
statement is matched at any row, append those rows with data - if not, append NULL to the rest of the columns
- RIGHT: Gives ALL selected rows from the RIGHT table in the
ON
statement and any common selected rows from the LEFT table.
SELECT gsn.first, gsn.last, arga.age, arga.residency
FROM ga_students_names as gsn
RIGHT JOIN ages_residencies_of_ga_students as arga
ON gsn.ida = arga.idb;
(left: gsn.id) = (right: gsn.idb)
first | last | age | residency |
---|---|---|---|
Jagermeister | Seminara | 59 | Battery Park |
Petunia | Petrov | 59 | GA 4th Floor |
12 | Lenox Hill |
As you can see, RIGHT JOIN operates the same way as LEFT JOIN, but reverses the order of the tables in the ON
statement.
- A lot less useful than the above, simply returns a table that is a full representation of every line of both table and its null elements.
For this example, the outcome would be:
first | last | age | residency |
---|---|---|---|
Jagermeister | Seminara | 59 | Battery Park |
Petunia | Petrov | 59 | GA 4th Floor |
12 | Lenox Hill | ||
Colossus | Hart |
We can add a lot of functionality to our join statements by doing several things:
- modifying the
ON
statement withAND
clauses in between- ex:
ON tableA.id = tableB.id and tableA.age > 50
- ex:
- any other select query manipulations
SUM
,PROD
,GROUP BY
,ORDER BY
...
Once tables are JOINED, it makes it easy to query for things that have common references in two or more tables!
In one of our class examples we had three tables in our 'carmen' database: country, city, and countrylanguage.
Each had a reference key (code, countrycode, countrycode) that allowed us to JOIN the different tables!
We had to find the percentage of people in the country where the city 'Oldenberg' is located that speak the country's official language. Now it doesnt seem that hard, does it?
First, we need to find out what we want to display in our table (and what we want to display it AS
).
In the select statement we can just do a little math and call from our tables, knowing that we will JOIN them later:
> SELECT ROUND(country.population*(countrylanguage.percentage/100)) as native_speakers, countrylanguage.language as language
Then we need to specify a table we are going to use in the JOIN.
Pick a table you want to initiate the SELECT
from. This table will be part of the ON
statement:
> FROM country
Pick one of the two tables we want to JOIN on, and pick your JOIN ROUTE (LEFT works great here!):
> LEFT JOIN city
Add the ON
statement, initiating the JOIN:
> ON country.code = city.countrycode
CHAINING TIME! It's easy, just add another JOIN / ON statement. :
> LEFT JOIN countrylanguage
> ON city.country
Last step - specify the question parameters in a WHERE
statement from ANY of the linked tables!
WHERE city.name LIKE 'Oldenberg' and countrylanguage.isofficial
Sometimes chaining join statements wont be enough to get the outcome that you want.
We might have one ID that corresponds to several ID's from another table in a cross-reference (xref) table.
How can we output a table that shows one ID, and ALL of its other ID references in another column?
You are tasked to create a database for burger orders.
- Each burger has 1 meat type.
- Each burger has 1 bun type.
- Each burger can have ANY amount of cheeses.
- Each burger can have ANY amount of toppings.
Our ERD should look something like this:
How can we write 1 query statement to pull from orders, bread, meat, and also shows ALL of the corresponding cheeses and toppings in the SAME row?
We have to use an aggregate function called array_agg()
in our statement.
- You can find information on aggregate postgreSQL functions at http://www.postgresql.org/docs/9.4/static/functions-aggregate.html
The array_agg
function is surprisingly easy to implement, just use it in your select statement in the table that has multiple many-to-one values (cheese, toppings).
Lets do this! The key in this scenario is to first pick a table you need to aggregate - well pick cheese first (remember any value not in the statement must be renamed as omething else in the statement).
What do we want to output?
SELECT
o.orderid,
meat.type as meat_type,
bun.type as bun_type,
array_agg(c.type) as cheeses
from orders o
Lets do an INNER JOIN
on the meat, because there will always be a matching meat:
INNER JOIN meat
ON o.meatid = meat.meatid
LEFT JOIN
orders-buns, there might be null values for buns:
LEFT JOIN bun
ON o.bunid = bun.bunid
We only care about the joined cheeses/orders, so lets INNER JOIN
the xref table:
INNER JOIN order_cheese oc
ON o.orderid = oc.orderid
Lastly we need to connect the cheese table and our cheese xref, we want to include null cheese values:
LEFT JOIN cheese c
ON c.cheeseid = oc.cheeseid
GROUP BY (o.orderid, meat.type, bun.type)
ORDER BY o.orderid
This full query will output an aggregated table for the cheeses:
orderid | meat_type | bun_type | cheeses |
---|---|---|---|
1 | Bison | Brioche | {Muenster} |
2 | Angus | Focacchia | {Swiss,Blue} |
How do we add ANOTHER row of aggregated results?
This is where it gets complicated, but if you followed along this far it's not that bad.
We need to make a select statement that includes toppings, but our entire FROM
table must be the result of the full query above.
- We need to rename the entire first query and use that query as our
FROM
table (essentially nesting our first table inside the search for our second one) - Once we call from that table, all we need to do is sort by the table itself, and voila!
select
orderid,
meat_type,
bun_type,
cheeses,
array_agg(t.type) as toppings
from (
select
o.orderid,
meat.type as meat_type,
bun.type as bun_type,
array_agg(c.type) as cheeses
from orders o
-- get the meats using meatid on both sides
inner join meat on o.meatid = meat.meatid
-- get the buns using bunid on both sides
left join bun on o.bunid = bun.bunid
-- use an inner join here because we only care about OUR orders/cheeses
inner join order_cheese oc on o.orderid = oc.orderid
-- left join here because we want to capture orders w/o cheese
left join cheese c on c.cheeseid = oc.cheeseid
group by (o.orderid, meat.type, bun.type)
order by o.orderid
) as justCheese
-- use an inner join here because we only care about OUR orders/cheeses
inner join order_topping ot using(orderid)
-- left join here because we want to capture orders w/o topping
left join toppings t using (toppingid)
group by orderid, meat_type, bun_type, cheeses
order by justCheese.orderid
This will result in this beatiful table, which can be called and reference in your query's results.rows!