This exercise set has 3 problems, not including 1 example problems, 1x.
Email me at dun@stanford.edu with the subject:
padjo-2017 homework **sql-ex-4** **your_sunet_id**
With a Google Doc (attached or linked-to) in which each query has its own page.
The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:
http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip
Related: Stack Overflow: Using group by and having clause
(warning: tortured analogy)
Understanding the HAVING clause doesn't require much more than understanding WHERE. However, the reason why HAVING exists is due to the internals (i.e. something we treat as a black box) of how SQL processes our query.
Imagine a table full of assorted fruit, bananas, oranges, kiwis, just all over the place.
If we wanted just the fruit that was bananas, we'd say something like:
Gather up the fruit that are bananas,
SELECT *
FROM fruits
WHERE type = 'banana';
What if we wanted to collect just the yellowish-fruits which as a group numbered at least 10 or more, i.e.
I want just the yellow fruits of which there are more than 10 of each on this table
In SQL, that query would look something like this:
SELECT type,
COUNT(*) AS quantity
FROM fruits
WHERE color = 'yellow'
GROUP BY type
HAVING quantity > 10;
In SQL, that quantity > 10
must be in the HAVING clause. Because the filtering conditions in the WHERE clause are done before the group-by-aggregation. The column "quantity" simply does not exist before the GROUP BY is done.
To go back to our table full of fruit analogy: it's possible to gather/discard fruit based on the color of 'yellow' before we start grouping the fruit into groups by type, i.e. 'lemon', 'bananas', etc. However, it is only that we've grouped the fruit together that we can count each type's quantity
category | total_count |
---|---|
LARCENY/THEFT | 187882 |
OTHER OFFENSES | 98907 |
NON-CRIMINAL | 94489 |
ASSAULT | 63874 |
Start off by doing a group count by grouping by the "category" column and doing a reverse sort of the count column (aliased as "total_count"):
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
GROUP BY category
ORDER BY total_count DESC;
For the sake of brevity, here are the first 10 or so rows:
category | total_count |
---|---|
LARCENY/THEFT | 187882 |
OTHER OFFENSES | 98907 |
NON-CRIMINAL | 94489 |
ASSAULT | 63874 |
VANDALISM | 38162 |
VEHICLE THEFT | 33896 |
WARRANTS | 33163 |
BURGLARY | 30241 |
SUSPICIOUS OCC | 28062 |
DRUG/NARCOTIC | 27125 |
MISSING PERSON | 22590 |
The point of the HAVING clause is to filter on the aggregated column, i.e. the "total_count" column. The aggregate happens because of the GROUP BY clause, so HAVING always comes after GROUP BY. Otherwise, it's not much different than WHERE
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
GROUP BY category
HAVING total_count > 50000
ORDER BY total_count DESC;
A common error is to try to use WHERE when only HAVING will do:
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
WHERE total_count > 50000
GROUP BY category
ORDER BY total_count DESC;
You should get this error:
misuse of aggregate: COUNT()
Likewise, HAVING is easily confused with WHERE. The following query does a silly filter for rows in which "category" is equal to 'BURGLARY', i.e. all the results will be burglaries:
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
GROUP BY category
HAVING category = 'BURGLARY'
ORDER BY total_count DESC;
However, this doesn't throw an error because HAVING works pretty much the same as WHERE:
category | total_count |
---|---|
BURGLARY | 30241 |
This is equally nonsensical without throwing any errors:
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
WHERE category = 'BURGLARY'
GROUP BY category
HAVING category = 'BURGLARY'
ORDER BY total_count DESC;
It still works. But it is nonsensical in a redundant way, like saying:
I want to buy the cat with a hat that is a cat
To reduce confusion, best practice in terms of SQL style is to only do filtering on aggregate columns in the HAVING clause.
description | total_count |
---|---|
AIDED CASE, MENTAL DISTURBED | 22228 |
DRIVERS LICENSE, SUSPENDED OR REVOKED | 22497 |
GRAND THEFT FROM LOCKED AUTO | 73239 |
LOST PROPERTY | 23236 |
PETTY THEFT OF PROPERTY | 22825 |
STOLEN AUTOMOBILE | 21383 |
SELECT
description,
COUNT(*) AS total_count
FROM
incidents
GROUP BY description
HAVING total_count > 20000
ORDER BY description ASC;
2. For the year 2015, list the precincts that had more than 20,000 incidents, in order of most incidents
pd_district | total_count |
---|---|
SOUTHERN | 30095 |
NORTHERN | 20089 |
SELECT
pd_district,
COUNT(*) AS total_count
FROM
incidents
WHERE
SUBSTR(datetime, 1, 4) = '2015'
GROUP BY pd_district
HAVING total_count > 20000
ORDER BY total_count DESC;
3. List the top 5 descriptions that have more than 1,000 incidents for the years 2014 through 2016, sorted in order of shortest to longest description
This one is a bit of a trick question. Not much different from the previous questions, but uses a LENGTH function (to determine the length of a string) as a way to sort the results.
If you can get this, then you have a decent sense of how WHERE and GROUP BY are related but different...
description | total_count |
---|---|
BATTERY | 11786 |
AIDED CASE | 3185 |
CONSPIRACY | 1410 |
TRESPASSING | 3497 |
CASE CLOSURE | 2272 |
FOUND PERSON | 6613 |
STOLEN TRUCK | 4595 |
LOST PROPERTY | 13874 |
MISSING ADULT | 4407 |
FOUND PROPERTY | 9190 |
In case you need a reminder on boolean expressions:
- Conditional filtering with WHERE
- Fuzzy matching in SQL
- Using the WHERE clause to filter data in SQL
- More Boolean Expressions to Filter SQL Queries
- Using LIKE, IN, BETWEEN, and wildcards to match multiple values in SQL
SELECT
description,
COUNT(*) AS total_count
FROM
incidents
WHERE
SUBSTR(datetime, 1, 4) IN ('2014', '2015', '2016')
GROUP BY description
HAVING total_count > 1000
ORDER BY
LENGTH(description) ASC
LIMIT 10;