SQL practice with SFPD incidents and the HAVING clause (4 of 4)
This exercise set has 3 problems, not including 1 example problems, 1x.
Email me at email@example.com with the subject:
padjo-2017 homework **sql-ex-4** **your_sunet_id**
With a Google Doc (attached or linked-to) in which each query has its own page.
The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:
(warning: tortured analogy)
Understanding the HAVING clause doesn't require much more than understanding WHERE. However, the reason why HAVING exists is due to the internals (i.e. something we treat as a black box) of how SQL processes our query.
Imagine a table full of assorted fruit, bananas, oranges, kiwis, just all over the place.
If we wanted just the fruit that was bananas, we'd say something like:
Gather up the fruit that are bananas,
SELECT * FROM fruits WHERE type = 'banana';
What if we wanted to collect just the yellowish-fruits which as a group numbered at least 10 or more, i.e.
I want just the yellow fruits of which there are more than 10 of each on this table
In SQL, that query would look something like this:
SELECT type, COUNT(*) AS quantity FROM fruits WHERE color = 'yellow' GROUP BY type HAVING quantity > 10;
In SQL, that
quantity > 10 must be in the HAVING clause. Because the filtering conditions in the WHERE clause are done before the group-by-aggregation. The column "quantity" simply does not exist before the GROUP BY is done.
To go back to our table full of fruit analogy: it's possible to gather/discard fruit based on the color of 'yellow' before we start grouping the fruit into groups by type, i.e. 'lemon', 'bananas', etc. However, it is only that we've grouped the fruit together that we can count each type's quantity
1x. (EXAMPLE) List the categories in which the incident count is greater than 50,000
Start off by doing a group count by grouping by the "category" column and doing a reverse sort of the count column (aliased as "total_count"):
SELECT category, COUNT(*) AS total_count FROM incidents GROUP BY category ORDER BY total_count DESC;
For the sake of brevity, here are the first 10 or so rows:
The point of the HAVING clause is to filter on the aggregated column, i.e. the "total_count" column. The aggregate happens because of the GROUP BY clause, so HAVING always comes after GROUP BY. Otherwise, it's not much different than WHERE
SELECT category, COUNT(*) AS total_count FROM incidents GROUP BY category HAVING total_count > 50000 ORDER BY total_count DESC;
A common error is to try to use WHERE when only HAVING will do:
SELECT category, COUNT(*) AS total_count FROM incidents WHERE total_count > 50000 GROUP BY category ORDER BY total_count DESC;
You should get this error:
misuse of aggregate: COUNT()
Likewise, HAVING is easily confused with WHERE. The following query does a silly filter for rows in which "category" is equal to 'BURGLARY', i.e. all the results will be burglaries:
SELECT category, COUNT(*) AS total_count FROM incidents GROUP BY category HAVING category = 'BURGLARY' ORDER BY total_count DESC;
However, this doesn't throw an error because HAVING works pretty much the same as WHERE:
This is equally nonsensical without throwing any errors:
SELECT category, COUNT(*) AS total_count FROM incidents WHERE category = 'BURGLARY' GROUP BY category HAVING category = 'BURGLARY' ORDER BY total_count DESC;
It still works. But it is nonsensical in a redundant way, like saying:
I want to buy the cat with a hat that is a cat
To reduce confusion, best practice in terms of SQL style is to only do filtering on aggregate columns in the HAVING clause.
1. List in alphabetical order the descriptions that have more than 20,000 incidents
|AIDED CASE, MENTAL DISTURBED||22228|
|DRIVERS LICENSE, SUSPENDED OR REVOKED||22497|
|GRAND THEFT FROM LOCKED AUTO||73239|
|PETTY THEFT OF PROPERTY||22825|
SELECT description, COUNT(*) AS total_count FROM incidents GROUP BY description HAVING total_count > 20000 ORDER BY description ASC;
2. For the year 2015, list the precincts that had more than 20,000 incidents, in order of most incidents
SELECT pd_district, COUNT(*) AS total_count FROM incidents WHERE SUBSTR(datetime, 1, 4) = '2015' GROUP BY pd_district HAVING total_count > 20000 ORDER BY total_count DESC;
3. List the top 5 descriptions that have more than 1,000 incidents for the years 2014 through 2016, sorted in order of shortest to longest description
This one is a bit of a trick question. Not much different from the previous questions, but uses a LENGTH function (to determine the length of a string) as a way to sort the results.
If you can get this, then you have a decent sense of how WHERE and GROUP BY are related but different...
In case you need a reminder on boolean expressions:
- Conditional filtering with WHERE
- Fuzzy matching in SQL
- Using the WHERE clause to filter data in SQL
- More Boolean Expressions to Filter SQL Queries
- Using LIKE, IN, BETWEEN, and wildcards to match multiple values in SQL
SELECT description, COUNT(*) AS total_count FROM incidents WHERE SUBSTR(datetime, 1, 4) IN ('2014', '2015', '2016') GROUP BY description HAVING total_count > 1000 ORDER BY LENGTH(description) ASC LIMIT 10;