This exercise set has 9 problems, not including 3 example problems (2x, 5x, and 6x)
Email me at dun@stanford.edu with the subject:
padjo-2017 homework **sql-ex-3** **your_sunet_id**
With a Google Doc (attached or linked-to) in which each query has its own page.
The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:
http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip
Unlike the previous set of exercises, which practiced using the COUNT aggregation function to provide a single result value (i.e. the row count of the result set), these exercises use aggregation with the GROUP BY clause.
Instead of counting the number of results, we can get a count of results by group. Just like we can with pivot tables.
resolution |
---|
ARREST, BOOKED |
ARREST, CITED |
CLEARED-CONTACT JUVENILE FOR MORE INFO |
COMPLAINANT REFUSES TO PROSECUTE |
DISTRICT ATTORNEY REFUSES TO PROSECUTE |
EXCEPTIONAL CLEARANCE |
JUVENILE ADMONISHED |
JUVENILE BOOKED |
JUVENILE CITED |
JUVENILE DIVERTED |
LOCATED |
NONE |
NOT PROSECUTED |
PROSECUTED BY OUTSIDE AGENCY |
PROSECUTED FOR LESSER OFFENSE |
PSYCHOPATHIC CASE |
UNFOUNDED |
This is pretty straightforward; requires a "GROUP BY" and "ORDER BY" clauses
SELECT
resolution
FROM
incidents
GROUP BY
resolution
ORDER BY
resolution ASC;
resolution | total_count |
---|---|
ARREST, BOOKED | 174718 |
ARREST, CITED | 34180 |
CLEARED-CONTACT JUVENILE FOR MORE INFO | 362 |
COMPLAINANT REFUSES TO PROSECUTE | 4595 |
DISTRICT ATTORNEY REFUSES TO PROSECUTE | 1055 |
EXCEPTIONAL CLEARANCE | 2188 |
JUVENILE ADMONISHED | 581 |
JUVENILE BOOKED | 4781 |
JUVENILE CITED | 1740 |
JUVENILE DIVERTED | 190 |
LOCATED | 10411 |
NONE | 494882 |
NOT PROSECUTED | 504 |
PROSECUTED BY OUTSIDE AGENCY | 975 |
PROSECUTED FOR LESSER OFFENSE | 5 |
PSYCHOPATHIC CASE | 11563 |
UNFOUNDED | 8317 |
SELECT
resolution,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
resolution
ORDER BY
resolution ASC;
Same as previous question, just a different sort order, and a LIMIT clause.
resolution | total_count |
---|---|
NONE | 494882 |
ARREST, BOOKED | 174718 |
ARREST, CITED | 34180 |
PSYCHOPATHIC CASE | 11563 |
LOCATED | 10411 |
{:.table-sql} |
SELECT
resolution,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
resolution
ORDER BY
total_count DESC
LIMIT 5;
Pretty much the exact same thing as the previous question, with a few details changed. See if you can write this out from scratch:
category | total_count |
---|---|
LARCENY/THEFT | 187882 |
OTHER OFFENSES | 98907 |
NON-CRIMINAL | 94489 |
ASSAULT | 63874 |
VANDALISM | 38162 |
VEHICLE THEFT | 33896 |
WARRANTS | 33163 |
BURGLARY | 30241 |
SUSPICIOUS OCC | 28062 |
DRUG/NARCOTIC | 27125 |
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
category
ORDER BY
total_count DESC
LIMIT 10;
Virtually the same as the prior question, except we're grouping by, and showing, two columns (besides the "total_count" aggregate column).
category | description | total_count |
---|---|---|
LARCENY/THEFT | GRAND THEFT FROM LOCKED AUTO | 73239 |
NON-CRIMINAL | LOST PROPERTY | 23236 |
LARCENY/THEFT | PETTY THEFT OF PROPERTY | 22825 |
OTHER OFFENSES | DRIVERS LICENSE, SUSPENDED OR REVOKED | 22497 |
NON-CRIMINAL | AIDED CASE, MENTAL DISTURBED | 22228 |
VEHICLE THEFT | STOLEN AUTOMOBILE | 21383 |
ASSAULT | BATTERY | 19121 |
WARRANTS | WARRANT ARREST | 18235 |
LARCENY/THEFT | PETTY THEFT FROM LOCKED AUTO | 18046 |
VANDALISM | MALICIOUS MISCHIEF, VANDALISM | 16465 |
SELECT
category, description,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
category, description
ORDER BY
total_count DESC
LIMIT 10;
5x. (EXAMPLE) List all descriptions of incidents categorized as "MISSING PERSON", in order of frequency
This is virtually the same as the prior exercise, but I make this an example to show something subtle about the results.
This is what I want to see -- note that the "category" column is left out:
description | total_count |
---|---|
FOUND PERSON | 10981 |
MISSING ADULT | 7054 |
MISSING JUVENILE | 4555 |
If you piggyback from the answer in the previous exercise, you might come up with this:
SELECT
category, description,
COUNT(*) AS total_count
FROM
incidents
WHERE
category = 'MISSING PERSON'
GROUP BY
category, description
ORDER BY
total_count DESC;
Which results in this:
category | description | total_count |
---|---|---|
MISSING PERSON | FOUND PERSON | 10981 |
MISSING PERSON | MISSING ADULT | 7054 |
MISSING PERSON | MISSING JUVENILE | 4555 |
The "total_count" column ends up being exactly the same, so the calculations were the same. The only difference is in structure: your result has the "category" column and my expected result does not. Why do I prefer the latter?
Think about what is happening in the WHERE clause:
WHERE
category = 'MISSING PERSON'
By definition, all results are going to have the "category" of 'MISSING PERSON', i.e. every row will have 'MISSING PERSON' or else it wouldn't be included in the results. What's the purpose of doing a group count by "category" if we know there aren't different groups?
And since we aren't doing a GROUP BY that refers to the "category" column, best practice is to omit it from the SELECT clause too.
Note how this query omits mention of the "category" column for purposes of displaying (via SELECT) or grouping (via GROUP BY):
SELECT
description,
COUNT(*) AS total_count
FROM
incidents
WHERE
category = 'MISSING PERSON'
GROUP BY
description
ORDER BY
total_count DESC;
5. List all descriptions of incidents categorized as either "MISSING PERSON" or "TRESPASS", in order of frequency
This is a situation in which we do want to include "category" as it can take on more than one value (two, in this case).
Without the "category" column, we'd run into problems in which two "description" values had different categories and are meant to be treated differently. That isn't the case with the categories of "MISSING PERSON" and "TRESSPASS", but we have no idea the taxonomy used by the SFPD.
category | description | total_count |
---|---|---|
MISSING PERSON | FOUND PERSON | 10981 |
MISSING PERSON | MISSING ADULT | 7054 |
TRESPASS | TRESPASSING | 5430 |
MISSING PERSON | MISSING JUVENILE | 4555 |
TRESPASS | LODGING WITHOUT PERMISSION | 1288 |
TRESPASS | LOITERING WITHOUT LAWFUL BUSINESS WITH OWNER OR OCCUPANT | 30 |
TRESPASS | TRESPASSING ON RAILROAD TRAINS | 7 |
TRESPASS | DISASTER AREA, ENTERING OR REMAINING IN | 3 |
SELECT
category,
description,
COUNT(*) AS total_count
FROM
incidents
WHERE
category = 'MISSING PERSON'
OR category = 'TRESPASS'
GROUP BY
category, description
ORDER BY
total_count DESC;
As before, treat the "datetime" column as a string, and use the SUBSTR function to select the first 4 characters and effectively create a new column. Alias that column as "year" and sort by that alias.
year | total_count |
---|---|
2012 | 140847 |
2013 | 152806 |
2014 | 150128 |
2015 | 156530 |
2016 | 150736 |
SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
GROUP BY year
ORDER BY year ASC;
Sometimes it's interesting to count/group by a time cycle, such as by month (as opposed to year and month). This is the same as counting by year, except extracting a different part of the datetime string.
month | total_count |
---|---|
01 | 63726 |
02 | 58656 |
03 | 63582 |
04 | 61351 |
05 | 63459 |
06 | 60730 |
07 | 62828 |
08 | 65574 |
09 | 63561 |
10 | 66932 |
11 | 61149 |
12 | 59499 |
If you can't figure out the arguments for the SUBSTR function to excerpt that middle part of the datetime value, try it out iteratively:
SELECT
datetime,
SUBSTR(datetime, 1, 10)
FROM incidents
LIMIT 5;
datetime | SUBSTR(datetime, 1, 10) |
---|---|
2012-01-01 00:01:00 | 2012-01-01 |
2012-01-01 00:01:00 | 2012-01-01 |
2012-01-01 00:01:00 | 2012-01-01 |
2012-01-01 00:01:00 | 2012-01-01 |
2012-01-01 00:01:00 | 2012-01-01 |
Keep trying until you get what you want:
SELECT
datetime,
SUBSTR(datetime, 2, 7)
FROM incidents
LIMIT 5;
datetime | SUBSTR(datetime, 2, 7) |
---|---|
2012-01-01 00:01:00 | 012-01- |
2012-01-01 00:01:00 | 012-01- |
2012-01-01 00:01:00 | 012-01- |
2012-01-01 00:01:00 | 012-01- |
2012-01-01 00:01:00 | 012-01- |
SELECT
SUBSTR(datetime, 6, 2) AS month,
COUNT(*) AS total_count
FROM incidents
GROUP BY month
ORDER BY month ASC;
year | total_count |
---|---|
2012 | 6444 |
2013 | 6775 |
2014 | 5408 |
2015 | 4254 |
2016 | 4244 |
SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
WHERE category = 'DRUG/NARCOTIC'
GROUP BY year
ORDER BY year ASC;
Pretend that there are "marijuana"-related incidents that are not just in the category of "DRUG/NARCOTIC". Assume that "having anything to do with 'marijuana'" means that the word 'marijuana' appears somewhere in the record.
Read: Fuzzy matching in SQL, particularly Matching patterns with LIKE
year | total_count |
---|---|
2012 | 1395 |
2013 | 1371 |
2014 | 1191 |
2015 | 892 |
2016 | 740 |
SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
WHERE description LIKE '%marijuana%'
GROUP BY year
ORDER BY year ASC;
Answer is very similar to the previous one, but requires a WHERE condition to filter on the "category" and "description" columns.
year | total_count |
---|---|
2012 | 5049 |
2013 | 5404 |
2014 | 4217 |
2015 | 3362 |
2016 | 3504 |
{:.table-sql} |
SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
WHERE description NOT LIKE '%marijuana%'
AND category = 'DRUG/NARCOTIC'
GROUP BY year
ORDER BY year ASC;