dannguyen/answers-sql-sfpd-incidents-grouping.md

## answers-sql-sfpd-incidents-grouping.md

      
    Raw
  

              answers-sql-sfpd-incidents-grouping.md
            
          
    SQL practice with SFPD incidents and group by aggregation (3 of 4)

Directions

This exercise set has 9 problems, not including 3 example problems (2x, 5x, and 6x)
Email me at dun@stanford.edu with the subject:
padjo-2017 homework **sql-ex-3** **your_sunet_id**

With a Google Doc (attached or linked-to) in which each query has its own page.
The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:
http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip
Introduction

Unlike the previous set of exercises, which practiced using the COUNT aggregation function to provide a single result value (i.e. the row count of the result set), these exercises use aggregation with the GROUP BY clause.
Instead of counting the number of results, we can get a count of results by group. Just like we can with pivot tables.
Related reading


Grouping in SQL
Aggregate functions in SQL

Exercises

1. Group by the resolution field, list in alphabetical order

Expected result


resolution


ARREST, BOOKED


ARREST, CITED


CLEARED-CONTACT JUVENILE FOR MORE INFO


COMPLAINANT REFUSES TO PROSECUTE


DISTRICT ATTORNEY REFUSES TO PROSECUTE


EXCEPTIONAL CLEARANCE


JUVENILE ADMONISHED


JUVENILE BOOKED


JUVENILE CITED


JUVENILE DIVERTED


LOCATED


NONE


NOT PROSECUTED


PROSECUTED BY OUTSIDE AGENCY


PROSECUTED FOR LESSER OFFENSE


PSYCHOPATHIC CASE


UNFOUNDED


Hints

This is pretty straightforward; requires a "GROUP BY" and "ORDER BY" clauses
Query

SELECT 
  resolution
FROM
  incidents
GROUP BY 
  resolution
ORDER BY 
  resolution ASC;
2x. (EXAMPLE) Do a group count of the resolution field, list resolutions in alphabetical order

Expected results


resolution
total_count


ARREST, BOOKED
174718


ARREST, CITED
34180


CLEARED-CONTACT JUVENILE FOR MORE INFO
362


COMPLAINANT REFUSES TO PROSECUTE
4595


DISTRICT ATTORNEY REFUSES TO PROSECUTE
1055


EXCEPTIONAL CLEARANCE
2188


JUVENILE ADMONISHED
581


JUVENILE BOOKED
4781


JUVENILE CITED
1740


JUVENILE DIVERTED
190


LOCATED
10411


NONE
494882


NOT PROSECUTED
504


PROSECUTED BY OUTSIDE AGENCY
975


PROSECUTED FOR LESSER OFFENSE
5


PSYCHOPATHIC CASE
11563


UNFOUNDED
8317


Query

SELECT 
  resolution,
  COUNT(*) AS total_count
FROM
  incidents
GROUP BY 
  resolution
ORDER BY 
  resolution ASC;
2. List the 5 most frequently-occurring values of the resolution field

Same as previous question, just a different sort order, and a LIMIT clause.
Expected results


resolution
total_count


NONE
494882


ARREST, BOOKED
174718


ARREST, CITED
34180


PSYCHOPATHIC CASE
11563


LOCATED
10411


{:.table-sql}


Query

SELECT 
  resolution,
  COUNT(*) AS total_count
FROM
  incidents
GROUP BY 
  resolution
ORDER BY 
  total_count DESC
LIMIT 5;
3. List the top 10 most frequently-occurring values of the category field

Pretty much the exact same thing as the previous question, with a few details changed. See if you can write this out from scratch:
Expected results


category
total_count


LARCENY/THEFT
187882


OTHER OFFENSES
98907


NON-CRIMINAL
94489


ASSAULT
63874


VANDALISM
38162


VEHICLE THEFT
33896


WARRANTS
33163


BURGLARY
30241


SUSPICIOUS OCC
28062


DRUG/NARCOTIC
27125


Query

SELECT 
  category,
  COUNT(*) AS total_count
FROM
  incidents
GROUP BY 
  category
ORDER BY 
  total_count DESC
LIMIT 10;
4. List the top 10 most frequently occurring category and description combinations

Virtually the same as the prior question, except we're grouping by, and showing, two columns (besides the "total_count" aggregate column).
Expected results


category
description
total_count


LARCENY/THEFT
GRAND THEFT FROM LOCKED AUTO
73239


NON-CRIMINAL
LOST PROPERTY
23236


LARCENY/THEFT
PETTY THEFT OF PROPERTY
22825


OTHER OFFENSES
DRIVERS LICENSE, SUSPENDED OR REVOKED
22497


NON-CRIMINAL
AIDED CASE, MENTAL DISTURBED
22228


VEHICLE THEFT
STOLEN AUTOMOBILE
21383


ASSAULT
BATTERY
19121


WARRANTS
WARRANT ARREST
18235


LARCENY/THEFT
PETTY THEFT FROM LOCKED AUTO
18046


VANDALISM
MALICIOUS MISCHIEF, VANDALISM
16465


Query

SELECT 
  category, description,
  COUNT(*) AS total_count
FROM
  incidents
GROUP BY 
  category, description
ORDER BY 
  total_count DESC
LIMIT 10;
5x. (EXAMPLE) List all descriptions of incidents categorized as  "MISSING PERSON", in order of frequency

This is virtually the same as the prior exercise, but I make this an example to show something subtle about the results.
Expected results

This is what I want to see -- note that the "category" column is left out:


description
total_count


FOUND PERSON
10981


MISSING ADULT
7054


MISSING JUVENILE
4555


Hints

If you piggyback from the answer in the previous exercise, you might come up with this:
SELECT 
  category, description,
  COUNT(*) AS total_count
FROM
  incidents
WHERE 
  category = 'MISSING PERSON'
GROUP BY 
  category, description
ORDER BY 
  total_count DESC;
Which results in this:


category
description
total_count


MISSING PERSON
FOUND PERSON
10981


MISSING PERSON
MISSING ADULT
7054


MISSING PERSON
MISSING JUVENILE
4555


The "total_count" column ends up being exactly the same, so the calculations were the same. The only difference is in structure: your result has the "category" column and my expected result does not. Why do I prefer the latter?
Think about what is happening in the WHERE clause:
WHERE 
  category = 'MISSING PERSON'
By definition, all results are going to have the "category" of 'MISSING PERSON', i.e. every row will have 'MISSING PERSON' or else it wouldn't be included in the results. What's the purpose of doing a group count by "category" if we know there aren't different groups?
And since we aren't doing a GROUP BY that refers to the "category" column, best practice is to omit it from the SELECT clause too.
Query

Note how this query omits mention of the "category" column  for purposes of displaying (via SELECT) or grouping (via GROUP BY):
SELECT 
  description,
  COUNT(*) AS total_count
FROM
  incidents
WHERE 
  category = 'MISSING PERSON'
GROUP BY 
  description
ORDER BY 
  total_count DESC;
5.  List all descriptions of incidents categorized as either  "MISSING PERSON" or "TRESPASS", in order of frequency

This is a situation in which we do want to include "category" as it can take on more than one value (two, in this case).
Expected results

Without the  "category" column, we'd run into problems in which two "description" values had different categories and are meant to be treated differently. That isn't the case with the categories of "MISSING PERSON" and "TRESSPASS", but we have no idea the taxonomy used by the SFPD.


category
description
total_count


MISSING PERSON
FOUND PERSON
10981


MISSING PERSON
MISSING ADULT
7054


TRESPASS
TRESPASSING
5430


MISSING PERSON
MISSING JUVENILE
4555


TRESPASS
LODGING WITHOUT PERMISSION
1288


TRESPASS
LOITERING WITHOUT LAWFUL BUSINESS WITH OWNER OR OCCUPANT
30


TRESPASS
TRESPASSING ON RAILROAD TRAINS
7


TRESPASS
DISASTER AREA, ENTERING OR REMAINING IN
3


Query

SELECT 
  category,
  description,
  COUNT(*) AS total_count
FROM
  incidents
WHERE 
  category = 'MISSING PERSON'
  OR category = 'TRESPASS'
GROUP BY
  category, description
ORDER BY 
  total_count DESC;
6x. (EXAMPLE) List total incidents per year

As before, treat the "datetime" column as a string, and use the SUBSTR function to select the first 4 characters and effectively create a new column. Alias that column as "year" and sort by that alias.
Expected results


year
total_count


2012
140847


2013
152806


2014
150128


2015
156530


2016
150736


Query

  SELECT 
    SUBSTR(datetime, 1, 4) AS year,
    COUNT(*) AS total_count
  FROM incidents
  GROUP BY year
  ORDER BY year ASC; 
6. List total incidents by month (ignoring year)

Sometimes it's interesting to count/group by a time cycle, such as by month (as opposed to year and month). This is the same as counting by year, except extracting a different part of the datetime string.
Expected results


month
total_count


01
63726


02
58656


03
63582


04
61351


05
63459


06
60730


07
62828


08
65574


09
63561


10
66932


11
61149


12
59499


Hints

If you can't figure out the arguments for the SUBSTR function to excerpt that middle part of the datetime value, try it out iteratively:
SELECT
  datetime,
  SUBSTR(datetime, 1, 10)
FROM incidents
LIMIT 5;


datetime
SUBSTR(datetime, 1, 10)


2012-01-01 00:01:00
2012-01-01


2012-01-01 00:01:00
2012-01-01


2012-01-01 00:01:00
2012-01-01


2012-01-01 00:01:00
2012-01-01


2012-01-01 00:01:00
2012-01-01


Keep trying until you get what you want:
SELECT
  datetime,
  SUBSTR(datetime, 2, 7)
FROM incidents
LIMIT 5;


datetime
SUBSTR(datetime, 2, 7)


2012-01-01 00:01:00
012-01-


2012-01-01 00:01:00
012-01-


2012-01-01 00:01:00
012-01-


2012-01-01 00:01:00
012-01-


2012-01-01 00:01:00
012-01-


Query

  SELECT 
    SUBSTR(datetime, 6, 2) AS month,
    COUNT(*) AS total_count
  FROM incidents
  GROUP BY month
  ORDER BY month ASC; 
7. Yearly count of incidents categorized as "DRUG/NARCOTIC"

Expected result


year
total_count


2012
6444


2013
6775


2014
5408


2015
4254


2016
4244


Query

  SELECT 
    SUBSTR(datetime, 1, 4) AS year,
    COUNT(*) AS total_count
  FROM incidents
  WHERE category = 'DRUG/NARCOTIC'
  GROUP BY year
  ORDER BY year ASC; 
8. Yearly count of incidents having anything to do with "marijuana"

Pretend that there are "marijuana"-related incidents that are not just in the category of "DRUG/NARCOTIC". Assume that "having anything to do with 'marijuana'" means that the word 'marijuana' appears somewhere in the record.
Read: Fuzzy matching in SQL, particularly Matching patterns with LIKE
Expected results


year
total_count


2012
1395


2013
1371


2014
1191


2015
892


2016
740


Query

SELECT 
  SUBSTR(datetime, 1, 4) AS year,
  COUNT(*) AS total_count
FROM incidents
  WHERE description LIKE '%marijuana%'
GROUP BY year
ORDER BY year ASC;
9. Yearly count of "DRUG/NARCOTIC" incidents that do not mention marijuana

Answer is very similar to the previous one, but requires a WHERE condition to filter on the "category" and "description" columns.
Expected results


year
total_count


2012
5049


2013
5404


2014
4217


2015
3362


2016
3504


{:.table-sql}


Query

SELECT 
  SUBSTR(datetime, 1, 4) AS year,
  COUNT(*) AS total_count
FROM incidents
  WHERE description NOT LIKE '%marijuana%'
  AND category = 'DRUG/NARCOTIC'
GROUP BY year
ORDER BY year ASC;
resolution	total_count
ARREST, BOOKED	174718
ARREST, CITED	34180
CLEARED-CONTACT JUVENILE FOR MORE INFO	362
COMPLAINANT REFUSES TO PROSECUTE	4595
DISTRICT ATTORNEY REFUSES TO PROSECUTE	1055
EXCEPTIONAL CLEARANCE	2188
JUVENILE ADMONISHED	581
JUVENILE BOOKED	4781
JUVENILE CITED	1740
JUVENILE DIVERTED	190
LOCATED	10411
NONE	494882
NOT PROSECUTED	504
PROSECUTED BY OUTSIDE AGENCY	975
PROSECUTED FOR LESSER OFFENSE	5
PSYCHOPATHIC CASE	11563
UNFOUNDED	8317
category	total_count
LARCENY/THEFT	187882
OTHER OFFENSES	98907
NON-CRIMINAL	94489
ASSAULT	63874
VANDALISM	38162
VEHICLE THEFT	33896
WARRANTS	33163
BURGLARY	30241
SUSPICIOUS OCC	28062
DRUG/NARCOTIC	27125
category	description	total_count
LARCENY/THEFT	GRAND THEFT FROM LOCKED AUTO	73239
NON-CRIMINAL	LOST PROPERTY	23236
LARCENY/THEFT	PETTY THEFT OF PROPERTY	22825
OTHER OFFENSES	DRIVERS LICENSE, SUSPENDED OR REVOKED	22497
NON-CRIMINAL	AIDED CASE, MENTAL DISTURBED	22228
VEHICLE THEFT	STOLEN AUTOMOBILE	21383
ASSAULT	BATTERY	19121
WARRANTS	WARRANT ARREST	18235
LARCENY/THEFT	PETTY THEFT FROM LOCKED AUTO	18046
VANDALISM	MALICIOUS MISCHIEF, VANDALISM	16465
category	description	total_count
MISSING PERSON	FOUND PERSON	10981
MISSING PERSON	MISSING ADULT	7054
MISSING PERSON	MISSING JUVENILE	4555
month	total_count
01	63726
02	58656
03	63582
04	61351
05	63459
06	60730
07	62828
08	65574
09	63561
10	66932
11	61149
12	59499
datetime	SUBSTR(datetime, 1, 10)
2012-01-01 00:01:00	2012-01-01
2012-01-01 00:01:00	2012-01-01
2012-01-01 00:01:00	2012-01-01
2012-01-01 00:01:00	2012-01-01
2012-01-01 00:01:00	2012-01-01
datetime	SUBSTR(datetime, 2, 7)
2012-01-01 00:01:00	012-01-
2012-01-01 00:01:00	012-01-
2012-01-01 00:01:00	012-01-
2012-01-01 00:01:00	012-01-
2012-01-01 00:01:00	012-01-