Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Created October 31, 2017 19:30
Show Gist options
  • Save dannguyen/bda87523cb69e02865d457bedc07ebd0 to your computer and use it in GitHub Desktop.
Save dannguyen/bda87523cb69e02865d457bedc07ebd0 to your computer and use it in GitHub Desktop.

SQL practice with SFPD incidents and group by aggregation (3 of 4)

Directions

This exercise set has 9 problems, not including 3 example problems (2x, 5x, and 6x)

Email me at dun@stanford.edu with the subject:

padjo-2017 homework **sql-ex-3** **your_sunet_id**

With a Google Doc (attached or linked-to) in which each query has its own page.

The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:

http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip

Introduction

Unlike the previous set of exercises, which practiced using the COUNT aggregation function to provide a single result value (i.e. the row count of the result set), these exercises use aggregation with the GROUP BY clause.

Instead of counting the number of results, we can get a count of results by group. Just like we can with pivot tables.

Related reading

Exercises

1. Group by the resolution field, list in alphabetical order

Expected result

resolution
ARREST, BOOKED
ARREST, CITED
CLEARED-CONTACT JUVENILE FOR MORE INFO
COMPLAINANT REFUSES TO PROSECUTE
DISTRICT ATTORNEY REFUSES TO PROSECUTE
EXCEPTIONAL CLEARANCE
JUVENILE ADMONISHED
JUVENILE BOOKED
JUVENILE CITED
JUVENILE DIVERTED
LOCATED
NONE
NOT PROSECUTED
PROSECUTED BY OUTSIDE AGENCY
PROSECUTED FOR LESSER OFFENSE
PSYCHOPATHIC CASE
UNFOUNDED

Hints

This is pretty straightforward; requires a "GROUP BY" and "ORDER BY" clauses

Query

SELECT 
  resolution
FROM
  incidents
GROUP BY 
  resolution
ORDER BY 
  resolution ASC;

2x. (EXAMPLE) Do a group count of the resolution field, list resolutions in alphabetical order

Expected results

resolution total_count
ARREST, BOOKED 174718
ARREST, CITED 34180
CLEARED-CONTACT JUVENILE FOR MORE INFO 362
COMPLAINANT REFUSES TO PROSECUTE 4595
DISTRICT ATTORNEY REFUSES TO PROSECUTE 1055
EXCEPTIONAL CLEARANCE 2188
JUVENILE ADMONISHED 581
JUVENILE BOOKED 4781
JUVENILE CITED 1740
JUVENILE DIVERTED 190
LOCATED 10411
NONE 494882
NOT PROSECUTED 504
PROSECUTED BY OUTSIDE AGENCY 975
PROSECUTED FOR LESSER OFFENSE 5
PSYCHOPATHIC CASE 11563
UNFOUNDED 8317

Query

SELECT 
  resolution,
  COUNT(*) AS total_count
FROM
  incidents
GROUP BY 
  resolution
ORDER BY 
  resolution ASC;

2. List the 5 most frequently-occurring values of the resolution field

Same as previous question, just a different sort order, and a LIMIT clause.

Expected results

resolution total_count
NONE 494882
ARREST, BOOKED 174718
ARREST, CITED 34180
PSYCHOPATHIC CASE 11563
LOCATED 10411
{:.table-sql}

Query

SELECT 
  resolution,
  COUNT(*) AS total_count
FROM
  incidents
GROUP BY 
  resolution
ORDER BY 
  total_count DESC
LIMIT 5;

3. List the top 10 most frequently-occurring values of the category field

Pretty much the exact same thing as the previous question, with a few details changed. See if you can write this out from scratch:

Expected results

category total_count
LARCENY/THEFT 187882
OTHER OFFENSES 98907
NON-CRIMINAL 94489
ASSAULT 63874
VANDALISM 38162
VEHICLE THEFT 33896
WARRANTS 33163
BURGLARY 30241
SUSPICIOUS OCC 28062
DRUG/NARCOTIC 27125

Query

SELECT 
  category,
  COUNT(*) AS total_count
FROM
  incidents
GROUP BY 
  category
ORDER BY 
  total_count DESC
LIMIT 10;

4. List the top 10 most frequently occurring category and description combinations

Virtually the same as the prior question, except we're grouping by, and showing, two columns (besides the "total_count" aggregate column).

Expected results

category description total_count
LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO 73239
NON-CRIMINAL LOST PROPERTY 23236
LARCENY/THEFT PETTY THEFT OF PROPERTY 22825
OTHER OFFENSES DRIVERS LICENSE, SUSPENDED OR REVOKED 22497
NON-CRIMINAL AIDED CASE, MENTAL DISTURBED 22228
VEHICLE THEFT STOLEN AUTOMOBILE 21383
ASSAULT BATTERY 19121
WARRANTS WARRANT ARREST 18235
LARCENY/THEFT PETTY THEFT FROM LOCKED AUTO 18046
VANDALISM MALICIOUS MISCHIEF, VANDALISM 16465

Query

SELECT 
  category, description,
  COUNT(*) AS total_count
FROM
  incidents
GROUP BY 
  category, description
ORDER BY 
  total_count DESC
LIMIT 10;

5x. (EXAMPLE) List all descriptions of incidents categorized as "MISSING PERSON", in order of frequency

This is virtually the same as the prior exercise, but I make this an example to show something subtle about the results.

Expected results

This is what I want to see -- note that the "category" column is left out:

description total_count
FOUND PERSON 10981
MISSING ADULT 7054
MISSING JUVENILE 4555

Hints

If you piggyback from the answer in the previous exercise, you might come up with this:

SELECT 
  category, description,
  COUNT(*) AS total_count
FROM
  incidents
WHERE 
  category = 'MISSING PERSON'
GROUP BY 
  category, description
ORDER BY 
  total_count DESC;

Which results in this:

category description total_count
MISSING PERSON FOUND PERSON 10981
MISSING PERSON MISSING ADULT 7054
MISSING PERSON MISSING JUVENILE 4555

The "total_count" column ends up being exactly the same, so the calculations were the same. The only difference is in structure: your result has the "category" column and my expected result does not. Why do I prefer the latter?

Think about what is happening in the WHERE clause:

WHERE 
  category = 'MISSING PERSON'

By definition, all results are going to have the "category" of 'MISSING PERSON', i.e. every row will have 'MISSING PERSON' or else it wouldn't be included in the results. What's the purpose of doing a group count by "category" if we know there aren't different groups?

And since we aren't doing a GROUP BY that refers to the "category" column, best practice is to omit it from the SELECT clause too.

Query

Note how this query omits mention of the "category" column for purposes of displaying (via SELECT) or grouping (via GROUP BY):

SELECT 
  description,
  COUNT(*) AS total_count
FROM
  incidents
WHERE 
  category = 'MISSING PERSON'
GROUP BY 
  description
ORDER BY 
  total_count DESC;

5. List all descriptions of incidents categorized as either "MISSING PERSON" or "TRESPASS", in order of frequency

This is a situation in which we do want to include "category" as it can take on more than one value (two, in this case).

Expected results

Without the "category" column, we'd run into problems in which two "description" values had different categories and are meant to be treated differently. That isn't the case with the categories of "MISSING PERSON" and "TRESSPASS", but we have no idea the taxonomy used by the SFPD.

category description total_count
MISSING PERSON FOUND PERSON 10981
MISSING PERSON MISSING ADULT 7054
TRESPASS TRESPASSING 5430
MISSING PERSON MISSING JUVENILE 4555
TRESPASS LODGING WITHOUT PERMISSION 1288
TRESPASS LOITERING WITHOUT LAWFUL BUSINESS WITH OWNER OR OCCUPANT 30
TRESPASS TRESPASSING ON RAILROAD TRAINS 7
TRESPASS DISASTER AREA, ENTERING OR REMAINING IN 3

Query

SELECT 
  category,
  description,
  COUNT(*) AS total_count
FROM
  incidents
WHERE 
  category = 'MISSING PERSON'
  OR category = 'TRESPASS'
GROUP BY
  category, description
ORDER BY 
  total_count DESC;

6x. (EXAMPLE) List total incidents per year

As before, treat the "datetime" column as a string, and use the SUBSTR function to select the first 4 characters and effectively create a new column. Alias that column as "year" and sort by that alias.

Expected results

year total_count
2012 140847
2013 152806
2014 150128
2015 156530
2016 150736

Query

  SELECT 
    SUBSTR(datetime, 1, 4) AS year,
    COUNT(*) AS total_count
  FROM incidents
  GROUP BY year
  ORDER BY year ASC; 

6. List total incidents by month (ignoring year)

Sometimes it's interesting to count/group by a time cycle, such as by month (as opposed to year and month). This is the same as counting by year, except extracting a different part of the datetime string.

Expected results

month total_count
01 63726
02 58656
03 63582
04 61351
05 63459
06 60730
07 62828
08 65574
09 63561
10 66932
11 61149
12 59499

Hints

If you can't figure out the arguments for the SUBSTR function to excerpt that middle part of the datetime value, try it out iteratively:

SELECT
  datetime,
  SUBSTR(datetime, 1, 10)
FROM incidents
LIMIT 5;
datetime SUBSTR(datetime, 1, 10)
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01

Keep trying until you get what you want:

SELECT
  datetime,
  SUBSTR(datetime, 2, 7)
FROM incidents
LIMIT 5;
datetime SUBSTR(datetime, 2, 7)
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-

Query

  SELECT 
    SUBSTR(datetime, 6, 2) AS month,
    COUNT(*) AS total_count
  FROM incidents
  GROUP BY month
  ORDER BY month ASC; 

7. Yearly count of incidents categorized as "DRUG/NARCOTIC"

Expected result

year total_count
2012 6444
2013 6775
2014 5408
2015 4254
2016 4244

Query

  SELECT 
    SUBSTR(datetime, 1, 4) AS year,
    COUNT(*) AS total_count
  FROM incidents
  WHERE category = 'DRUG/NARCOTIC'
  GROUP BY year
  ORDER BY year ASC; 

8. Yearly count of incidents having anything to do with "marijuana"

Pretend that there are "marijuana"-related incidents that are not just in the category of "DRUG/NARCOTIC". Assume that "having anything to do with 'marijuana'" means that the word 'marijuana' appears somewhere in the record.

Read: Fuzzy matching in SQL, particularly Matching patterns with LIKE

Expected results

year total_count
2012 1395
2013 1371
2014 1191
2015 892
2016 740

Query

SELECT 
  SUBSTR(datetime, 1, 4) AS year,
  COUNT(*) AS total_count
FROM incidents
  WHERE description LIKE '%marijuana%'
GROUP BY year
ORDER BY year ASC;

9. Yearly count of "DRUG/NARCOTIC" incidents that do not mention marijuana

Answer is very similar to the previous one, but requires a WHERE condition to filter on the "category" and "description" columns.

Expected results

year total_count
2012 5049
2013 5404
2014 4217
2015 3362
2016 3504
{:.table-sql}

Query

SELECT 
  SUBSTR(datetime, 1, 4) AS year,
  COUNT(*) AS total_count
FROM incidents
  WHERE description NOT LIKE '%marijuana%'
  AND category = 'DRUG/NARCOTIC'
GROUP BY year
ORDER BY year ASC;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment