SQL practice with SFPD incidents and group by aggregation (3 of 4)
Directions
This exercise set has 9 problems, not including 3 example problems (2x, 5x, and 6x)
Email me at dun@stanford.edu with the subject:
padjo-2017 homework **sql-ex-3** **your_sunet_id**
With a Google Doc (attached or linked-to) in which each query has its own page.
The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:
http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip
Introduction
Unlike the previous set of exercises, which practiced using the COUNT aggregation function to provide a single result value (i.e. the row count of the result set), these exercises use aggregation with the GROUP BY clause.
Instead of counting the number of results, we can get a count of results by group. Just like we can with pivot tables.
Related reading
Exercises
1. Group by the resolution
field, list in alphabetical order
Expected result
resolution |
---|
ARREST, BOOKED |
ARREST, CITED |
CLEARED-CONTACT JUVENILE FOR MORE INFO |
COMPLAINANT REFUSES TO PROSECUTE |
DISTRICT ATTORNEY REFUSES TO PROSECUTE |
EXCEPTIONAL CLEARANCE |
JUVENILE ADMONISHED |
JUVENILE BOOKED |
JUVENILE CITED |
JUVENILE DIVERTED |
LOCATED |
NONE |
NOT PROSECUTED |
PROSECUTED BY OUTSIDE AGENCY |
PROSECUTED FOR LESSER OFFENSE |
PSYCHOPATHIC CASE |
UNFOUNDED |
Hints
This is pretty straightforward; requires a "GROUP BY" and "ORDER BY" clauses
Query
SELECT
resolution
FROM
incidents
GROUP BY
resolution
ORDER BY
resolution ASC;
2x. (EXAMPLE) Do a group count of the resolution
field, list resolutions in alphabetical order
Expected results
resolution | total_count |
---|---|
ARREST, BOOKED | 174718 |
ARREST, CITED | 34180 |
CLEARED-CONTACT JUVENILE FOR MORE INFO | 362 |
COMPLAINANT REFUSES TO PROSECUTE | 4595 |
DISTRICT ATTORNEY REFUSES TO PROSECUTE | 1055 |
EXCEPTIONAL CLEARANCE | 2188 |
JUVENILE ADMONISHED | 581 |
JUVENILE BOOKED | 4781 |
JUVENILE CITED | 1740 |
JUVENILE DIVERTED | 190 |
LOCATED | 10411 |
NONE | 494882 |
NOT PROSECUTED | 504 |
PROSECUTED BY OUTSIDE AGENCY | 975 |
PROSECUTED FOR LESSER OFFENSE | 5 |
PSYCHOPATHIC CASE | 11563 |
UNFOUNDED | 8317 |
Query
SELECT
resolution,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
resolution
ORDER BY
resolution ASC;
2. List the 5 most frequently-occurring values of the resolution
field
Same as previous question, just a different sort order, and a LIMIT clause.
Expected results
resolution | total_count |
---|---|
NONE | 494882 |
ARREST, BOOKED | 174718 |
ARREST, CITED | 34180 |
PSYCHOPATHIC CASE | 11563 |
LOCATED | 10411 |
{:.table-sql} |
Query
SELECT
resolution,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
resolution
ORDER BY
total_count DESC
LIMIT 5;
3. List the top 10 most frequently-occurring values of the category
field
Pretty much the exact same thing as the previous question, with a few details changed. See if you can write this out from scratch:
Expected results
category | total_count |
---|---|
LARCENY/THEFT | 187882 |
OTHER OFFENSES | 98907 |
NON-CRIMINAL | 94489 |
ASSAULT | 63874 |
VANDALISM | 38162 |
VEHICLE THEFT | 33896 |
WARRANTS | 33163 |
BURGLARY | 30241 |
SUSPICIOUS OCC | 28062 |
DRUG/NARCOTIC | 27125 |
Query
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
category
ORDER BY
total_count DESC
LIMIT 10;
4. List the top 10 most frequently occurring category
and description
combinations
Virtually the same as the prior question, except we're grouping by, and showing, two columns (besides the "total_count" aggregate column).
Expected results
category | description | total_count |
---|---|---|
LARCENY/THEFT | GRAND THEFT FROM LOCKED AUTO | 73239 |
NON-CRIMINAL | LOST PROPERTY | 23236 |
LARCENY/THEFT | PETTY THEFT OF PROPERTY | 22825 |
OTHER OFFENSES | DRIVERS LICENSE, SUSPENDED OR REVOKED | 22497 |
NON-CRIMINAL | AIDED CASE, MENTAL DISTURBED | 22228 |
VEHICLE THEFT | STOLEN AUTOMOBILE | 21383 |
ASSAULT | BATTERY | 19121 |
WARRANTS | WARRANT ARREST | 18235 |
LARCENY/THEFT | PETTY THEFT FROM LOCKED AUTO | 18046 |
VANDALISM | MALICIOUS MISCHIEF, VANDALISM | 16465 |
Query
SELECT
category, description,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
category, description
ORDER BY
total_count DESC
LIMIT 10;
5x. (EXAMPLE) List all descriptions of incidents categorized as "MISSING PERSON", in order of frequency
This is virtually the same as the prior exercise, but I make this an example to show something subtle about the results.
Expected results
This is what I want to see -- note that the "category" column is left out:
description | total_count |
---|---|
FOUND PERSON | 10981 |
MISSING ADULT | 7054 |
MISSING JUVENILE | 4555 |
Hints
If you piggyback from the answer in the previous exercise, you might come up with this:
SELECT
category, description,
COUNT(*) AS total_count
FROM
incidents
WHERE
category = 'MISSING PERSON'
GROUP BY
category, description
ORDER BY
total_count DESC;
Which results in this:
category | description | total_count |
---|---|---|
MISSING PERSON | FOUND PERSON | 10981 |
MISSING PERSON | MISSING ADULT | 7054 |
MISSING PERSON | MISSING JUVENILE | 4555 |
The "total_count" column ends up being exactly the same, so the calculations were the same. The only difference is in structure: your result has the "category" column and my expected result does not. Why do I prefer the latter?
Think about what is happening in the WHERE clause:
WHERE
category = 'MISSING PERSON'
By definition, all results are going to have the "category" of 'MISSING PERSON', i.e. every row will have 'MISSING PERSON' or else it wouldn't be included in the results. What's the purpose of doing a group count by "category" if we know there aren't different groups?
And since we aren't doing a GROUP BY that refers to the "category" column, best practice is to omit it from the SELECT clause too.
Query
Note how this query omits mention of the "category" column for purposes of displaying (via SELECT) or grouping (via GROUP BY):
SELECT
description,
COUNT(*) AS total_count
FROM
incidents
WHERE
category = 'MISSING PERSON'
GROUP BY
description
ORDER BY
total_count DESC;
5. List all descriptions of incidents categorized as either "MISSING PERSON" or "TRESPASS", in order of frequency
This is a situation in which we do want to include "category" as it can take on more than one value (two, in this case).
Expected results
Without the "category" column, we'd run into problems in which two "description" values had different categories and are meant to be treated differently. That isn't the case with the categories of "MISSING PERSON" and "TRESSPASS", but we have no idea the taxonomy used by the SFPD.
category | description | total_count |
---|---|---|
MISSING PERSON | FOUND PERSON | 10981 |
MISSING PERSON | MISSING ADULT | 7054 |
TRESPASS | TRESPASSING | 5430 |
MISSING PERSON | MISSING JUVENILE | 4555 |
TRESPASS | LODGING WITHOUT PERMISSION | 1288 |
TRESPASS | LOITERING WITHOUT LAWFUL BUSINESS WITH OWNER OR OCCUPANT | 30 |
TRESPASS | TRESPASSING ON RAILROAD TRAINS | 7 |
TRESPASS | DISASTER AREA, ENTERING OR REMAINING IN | 3 |
Query
SELECT
category,
description,
COUNT(*) AS total_count
FROM
incidents
WHERE
category = 'MISSING PERSON'
OR category = 'TRESPASS'
GROUP BY
category, description
ORDER BY
total_count DESC;
6x. (EXAMPLE) List total incidents per year
As before, treat the "datetime" column as a string, and use the SUBSTR function to select the first 4 characters and effectively create a new column. Alias that column as "year" and sort by that alias.
Expected results
year | total_count |
---|---|
2012 | 140847 |
2013 | 152806 |
2014 | 150128 |
2015 | 156530 |
2016 | 150736 |
Query
SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
GROUP BY year
ORDER BY year ASC;
6. List total incidents by month (ignoring year)
Sometimes it's interesting to count/group by a time cycle, such as by month (as opposed to year and month). This is the same as counting by year, except extracting a different part of the datetime string.
Expected results
month | total_count |
---|---|
01 | 63726 |
02 | 58656 |
03 | 63582 |
04 | 61351 |
05 | 63459 |
06 | 60730 |
07 | 62828 |
08 | 65574 |
09 | 63561 |
10 | 66932 |
11 | 61149 |
12 | 59499 |
Hints
If you can't figure out the arguments for the SUBSTR function to excerpt that middle part of the datetime value, try it out iteratively:
SELECT
datetime,
SUBSTR(datetime, 1, 10)
FROM incidents
LIMIT 5;
datetime | SUBSTR(datetime, 1, 10) |
---|---|
2012-01-01 00:01:00 | 2012-01-01 |
2012-01-01 00:01:00 | 2012-01-01 |
2012-01-01 00:01:00 | 2012-01-01 |
2012-01-01 00:01:00 | 2012-01-01 |
2012-01-01 00:01:00 | 2012-01-01 |
Keep trying until you get what you want:
SELECT
datetime,
SUBSTR(datetime, 2, 7)
FROM incidents
LIMIT 5;
datetime | SUBSTR(datetime, 2, 7) |
---|---|
2012-01-01 00:01:00 | 012-01- |
2012-01-01 00:01:00 | 012-01- |
2012-01-01 00:01:00 | 012-01- |
2012-01-01 00:01:00 | 012-01- |
2012-01-01 00:01:00 | 012-01- |
Query
SELECT
SUBSTR(datetime, 6, 2) AS month,
COUNT(*) AS total_count
FROM incidents
GROUP BY month
ORDER BY month ASC;
7. Yearly count of incidents categorized as "DRUG/NARCOTIC"
Expected result
year | total_count |
---|---|
2012 | 6444 |
2013 | 6775 |
2014 | 5408 |
2015 | 4254 |
2016 | 4244 |
Query
SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
WHERE category = 'DRUG/NARCOTIC'
GROUP BY year
ORDER BY year ASC;
8. Yearly count of incidents having anything to do with "marijuana"
Pretend that there are "marijuana"-related incidents that are not just in the category of "DRUG/NARCOTIC". Assume that "having anything to do with 'marijuana'" means that the word 'marijuana' appears somewhere in the record.
Read: Fuzzy matching in SQL, particularly Matching patterns with LIKE
Expected results
year | total_count |
---|---|
2012 | 1395 |
2013 | 1371 |
2014 | 1191 |
2015 | 892 |
2016 | 740 |
Query
SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
WHERE description LIKE '%marijuana%'
GROUP BY year
ORDER BY year ASC;
9. Yearly count of "DRUG/NARCOTIC" incidents that do not mention marijuana
Answer is very similar to the previous one, but requires a WHERE condition to filter on the "category" and "description" columns.
Expected results
year | total_count |
---|---|
2012 | 5049 |
2013 | 5404 |
2014 | 4217 |
2015 | 3362 |
2016 | 3504 |
{:.table-sql} |
Query
SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
WHERE description NOT LIKE '%marijuana%'
AND category = 'DRUG/NARCOTIC'
GROUP BY year
ORDER BY year ASC;