Instantly share code, notes, and snippets.

Created October 31, 2017 19:30
Show Gist options
• Save dannguyen/bda87523cb69e02865d457bedc07ebd0 to your computer and use it in GitHub Desktop.

# SQL practice with SFPD incidents and group by aggregation (3 of 4)

## Directions

This exercise set has 9 problems, not including 3 example problems (2x, 5x, and 6x)

Email me at dun@stanford.edu with the subject:

``````padjo-2017 homework **sql-ex-3** **your_sunet_id**
``````

With a Google Doc (attached or linked-to) in which each query has its own page.

The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:

## Introduction

Unlike the previous set of exercises, which practiced using the COUNT aggregation function to provide a single result value (i.e. the row count of the result set), these exercises use aggregation with the GROUP BY clause.

Instead of counting the number of results, we can get a count of results by group. Just like we can with pivot tables.

## Exercises

### 1. Group by the `resolution` field, list in alphabetical order

#### Expected result

resolution
ARREST, BOOKED
ARREST, CITED
COMPLAINANT REFUSES TO PROSECUTE
DISTRICT ATTORNEY REFUSES TO PROSECUTE
EXCEPTIONAL CLEARANCE
JUVENILE BOOKED
JUVENILE CITED
JUVENILE DIVERTED
LOCATED
NONE
NOT PROSECUTED
PROSECUTED BY OUTSIDE AGENCY
PROSECUTED FOR LESSER OFFENSE
PSYCHOPATHIC CASE
UNFOUNDED

#### Hints

This is pretty straightforward; requires a "GROUP BY" and "ORDER BY" clauses

#### Query

```SELECT
resolution
FROM
incidents
GROUP BY
resolution
ORDER BY
resolution ASC;```

### 2x. (EXAMPLE) Do a group count of the `resolution` field, list resolutions in alphabetical order

#### Expected results

resolution total_count
ARREST, BOOKED 174718
ARREST, CITED 34180
COMPLAINANT REFUSES TO PROSECUTE 4595
DISTRICT ATTORNEY REFUSES TO PROSECUTE 1055
EXCEPTIONAL CLEARANCE 2188
JUVENILE BOOKED 4781
JUVENILE CITED 1740
JUVENILE DIVERTED 190
LOCATED 10411
NONE 494882
NOT PROSECUTED 504
PROSECUTED BY OUTSIDE AGENCY 975
PROSECUTED FOR LESSER OFFENSE 5
PSYCHOPATHIC CASE 11563
UNFOUNDED 8317

#### Query

```SELECT
resolution,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
resolution
ORDER BY
resolution ASC;```

### 2. List the 5 most frequently-occurring values of the `resolution` field

Same as previous question, just a different sort order, and a LIMIT clause.

#### Expected results

resolution total_count
NONE 494882
ARREST, BOOKED 174718
ARREST, CITED 34180
PSYCHOPATHIC CASE 11563
LOCATED 10411
{:.table-sql}

#### Query

```SELECT
resolution,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
resolution
ORDER BY
total_count DESC
LIMIT 5;```

### 3. List the top 10 most frequently-occurring values of the `category` field

Pretty much the exact same thing as the previous question, with a few details changed. See if you can write this out from scratch:

#### Expected results

category total_count
LARCENY/THEFT 187882
OTHER OFFENSES 98907
NON-CRIMINAL 94489
ASSAULT 63874
VANDALISM 38162
VEHICLE THEFT 33896
WARRANTS 33163
BURGLARY 30241
SUSPICIOUS OCC 28062
DRUG/NARCOTIC 27125

#### Query

```SELECT
category,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
category
ORDER BY
total_count DESC
LIMIT 10;```

### 4. List the top 10 most frequently occurring `category` and `description` combinations

Virtually the same as the prior question, except we're grouping by, and showing, two columns (besides the "total_count" aggregate column).

#### Expected results

category description total_count
LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO 73239
NON-CRIMINAL LOST PROPERTY 23236
LARCENY/THEFT PETTY THEFT OF PROPERTY 22825
OTHER OFFENSES DRIVERS LICENSE, SUSPENDED OR REVOKED 22497
NON-CRIMINAL AIDED CASE, MENTAL DISTURBED 22228
VEHICLE THEFT STOLEN AUTOMOBILE 21383
ASSAULT BATTERY 19121
WARRANTS WARRANT ARREST 18235
LARCENY/THEFT PETTY THEFT FROM LOCKED AUTO 18046
VANDALISM MALICIOUS MISCHIEF, VANDALISM 16465

#### Query

```SELECT
category, description,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
category, description
ORDER BY
total_count DESC
LIMIT 10;```

### 5x. (EXAMPLE) List all descriptions of incidents categorized as "MISSING PERSON", in order of frequency

This is virtually the same as the prior exercise, but I make this an example to show something subtle about the results.

#### Expected results

This is what I want to see -- note that the "category" column is left out:

description total_count
FOUND PERSON 10981
MISSING JUVENILE 4555

#### Hints

If you piggyback from the answer in the previous exercise, you might come up with this:

```SELECT
category, description,
COUNT(*) AS total_count
FROM
incidents
WHERE
category = 'MISSING PERSON'
GROUP BY
category, description
ORDER BY
total_count DESC;```

Which results in this:

category description total_count
MISSING PERSON FOUND PERSON 10981
MISSING PERSON MISSING JUVENILE 4555

The "total_count" column ends up being exactly the same, so the calculations were the same. The only difference is in structure: your result has the "category" column and my expected result does not. Why do I prefer the latter?

Think about what is happening in the WHERE clause:

```WHERE
category = 'MISSING PERSON'```

By definition, all results are going to have the "category" of 'MISSING PERSON', i.e. every row will have 'MISSING PERSON' or else it wouldn't be included in the results. What's the purpose of doing a group count by "category" if we know there aren't different groups?

And since we aren't doing a GROUP BY that refers to the "category" column, best practice is to omit it from the SELECT clause too.

#### Query

Note how this query omits mention of the "category" column for purposes of displaying (via SELECT) or grouping (via GROUP BY):

```SELECT
description,
COUNT(*) AS total_count
FROM
incidents
WHERE
category = 'MISSING PERSON'
GROUP BY
description
ORDER BY
total_count DESC;```

### 5. List all descriptions of incidents categorized as either "MISSING PERSON" or "TRESPASS", in order of frequency

This is a situation in which we do want to include "category" as it can take on more than one value (two, in this case).

#### Expected results

Without the "category" column, we'd run into problems in which two "description" values had different categories and are meant to be treated differently. That isn't the case with the categories of "MISSING PERSON" and "TRESSPASS", but we have no idea the taxonomy used by the SFPD.

category description total_count
MISSING PERSON FOUND PERSON 10981
TRESPASS TRESPASSING 5430
MISSING PERSON MISSING JUVENILE 4555
TRESPASS LODGING WITHOUT PERMISSION 1288
TRESPASS LOITERING WITHOUT LAWFUL BUSINESS WITH OWNER OR OCCUPANT 30
TRESPASS TRESPASSING ON RAILROAD TRAINS 7
TRESPASS DISASTER AREA, ENTERING OR REMAINING IN 3

#### Query

```SELECT
category,
description,
COUNT(*) AS total_count
FROM
incidents
WHERE
category = 'MISSING PERSON'
OR category = 'TRESPASS'
GROUP BY
category, description
ORDER BY
total_count DESC;```

### 6x. (EXAMPLE) List total incidents per year

As before, treat the "datetime" column as a string, and use the SUBSTR function to select the first 4 characters and effectively create a new column. Alias that column as "year" and sort by that alias.

year total_count
2012 140847
2013 152806
2014 150128
2015 156530
2016 150736

#### Query

```  SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
GROUP BY year
ORDER BY year ASC; ```

### 6. List total incidents by month (ignoring year)

Sometimes it's interesting to count/group by a time cycle, such as by month (as opposed to year and month). This is the same as counting by year, except extracting a different part of the datetime string.

#### Expected results

month total_count
01 63726
02 58656
03 63582
04 61351
05 63459
06 60730
07 62828
08 65574
09 63561
10 66932
11 61149
12 59499

#### Hints

If you can't figure out the arguments for the SUBSTR function to excerpt that middle part of the datetime value, try it out iteratively:

```SELECT
datetime,
SUBSTR(datetime, 1, 10)
FROM incidents
LIMIT 5;```
datetime SUBSTR(datetime, 1, 10)
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01

Keep trying until you get what you want:

```SELECT
datetime,
SUBSTR(datetime, 2, 7)
FROM incidents
LIMIT 5;```
datetime SUBSTR(datetime, 2, 7)
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-

#### Query

```  SELECT
SUBSTR(datetime, 6, 2) AS month,
COUNT(*) AS total_count
FROM incidents
GROUP BY month
ORDER BY month ASC; ```

### 7. Yearly count of incidents categorized as "DRUG/NARCOTIC"

year total_count
2012 6444
2013 6775
2014 5408
2015 4254
2016 4244

#### Query

```  SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
WHERE category = 'DRUG/NARCOTIC'
GROUP BY year
ORDER BY year ASC; ```

### 8. Yearly count of incidents having anything to do with "marijuana"

Pretend that there are "marijuana"-related incidents that are not just in the category of "DRUG/NARCOTIC". Assume that "having anything to do with 'marijuana'" means that the word 'marijuana' appears somewhere in the record.

Read: Fuzzy matching in SQL, particularly Matching patterns with LIKE

year total_count
2012 1395
2013 1371
2014 1191
2015 892
2016 740

#### Query

```SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
WHERE description LIKE '%marijuana%'
GROUP BY year
ORDER BY year ASC;```

### 9. Yearly count of "DRUG/NARCOTIC" incidents that do not mention marijuana

Answer is very similar to the previous one, but requires a WHERE condition to filter on the "category" and "description" columns.

year total_count
2012 5049
2013 5404
2014 4217
2015 3362
2016 3504
{:.table-sql}

#### Query

```SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
WHERE description NOT LIKE '%marijuana%'
AND category = 'DRUG/NARCOTIC'
GROUP BY year
ORDER BY year ASC;```