Skip to content

Instantly share code, notes, and snippets.

@dannguyen

dannguyen/00.md

Last active Oct 26, 2017
Embed
What would you like to do?
This page contains all of the SQL SFPD exercises for PADJO 2017, just so they're all on 1 page.

THE TOP NAV

(fyi, the class homepage: http://2017.padjo.org//assignments/homework/sql-practice-1/sql-exercises-sfpd)

This is a quick nav to the 4 exercise sections -- it's easy to get lost if you try to scroll through them

Due: Wednesday, October 24, 2017, 11:59 PM

Due: Friday, October 26, 2017, 11:59 PM

Basic instructions

Each exercise section has a set of exercises with a (perhaps too vague) prompt, like:

Total number of incidents categorized as "ASSAULT"

And the expected results from a SQL query:

total_assaults
63874

Your job is to: write the SQL query that answers the prompt and gets ther results.

In this case:

SELECT 
  COUNT(*) AS total_assaults
FROM 
  incidents
WHERE 
  category = 'ASSAULT';

For each exercise section, you'll email me a Word/Google doc in which each page has its own query.

Btw, the above example was the actual answer for Set 2, Exercise 2, but please write it out for yourself and make sure it works before just handing it in.

SQL practice with SFPD Incidents; Select/Sort/Filter (1 of 4)

Answers: https://gist.github.com/dannguyen/d06f8fb41722dd777c70acf83b1e03ad

Introduction

This is an overview lesson that covers all of the syntax needed to find and sort records in a database before getting into the GROUP BY clause.

Think of these exercises as covering database basics before getting into the aggregation functionality of GROUP BY. It's just like wanting to cover the many features of spreadsheets before getting into pivot tables and aggregation.

Directions

This exercise set has 6 problems, not including two example problems (i.e. 1x and 4x).

Email me at dun@stanford.edu with the subject:

padjo-2017 homework sql-ex-1 your_sunet_id

With a Google Doc (attached or linked-to) in which each query has its own page.

The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:

http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip

Related reading

From the 2014 syllabus (more complete, maybe not as concise):

From the 2015 syllabus -- more broken down than 2014, but might have some incomplete spots:

It is an overwhelming number of concepts and syntax, but mastery isn't expected, just familiarity for now.


Exercises

1x. (EXAMPLE) Show just the datetime column from the very first incident in the database

"Very first incident" in the database basically means, whatever the first row is in the data table without an additional sorting of our own. Use a LIMIT clause to make sure the results only include a single row.

Related:

Expected result

datetime
2012-01-01 00:01:00

Hints

If this is your first SQL of the day, you can try these queries first:

  • Select 'hello'

    SELECT 'hello';
    
  • Select 'hello' for every row that is in "incidents"

    SELECT 'hello'
    FROM incidents;
    
  • Select all the columns for all the rows in "incidents" (may be a lot for computer to process)

    SELECT *
    FROM incidents;
    
  • Select all the columns for the first 5 rows

    SELECT *
    FROM incidents;
    LIMIT 5;
    

Query

SELECT datetime
FROM incidents
LIMIT 1;

1. Find selected columns from the very first incident in the database

Many of the columns in this database aren't needed for the exercises. So write a query that pulls only these fields, in this order:

  • incident_number
  • datetime
  • category
  • description
  • resolution
  • pd_district

Expected result

incident_number datetime category description resolution pd_district
130,261,396 2012-01-01 00:01:00 OTHER OFFENSES FALSE PERSONATION BAYVIEW

Query

SELECT 
  incident_number,
  datetime,
  category,
  description,
  resolution,
  pd_district
FROM incidents
LIMIT 1;

2. Find the very last incident (chronologically) in the database

Related:

Expected results

incident_number datetime category description resolution pd_district
170000198 2016-12-31 23:58:00 DRIVING UNDER THE INFLUENCE DRIVING WHILE UNDER THE INFLUENCE OF DRUGS ARREST, BOOKED MISSION
{:.table-sql}

Query

SELECT 
  incident_number,
  datetime,
  category,
  description,
  resolution,
  pd_district
FROM 
  incidents
ORDER BY 
  datetime DESC
LIMIT 1;

3. Find the earliest five crimes (chronologically) categorized as "VEHICLE THEFT"

Related:

Expected results

incident_number datetime category description resolution pd_district
120001613 2012-01-01 02:30:00 VEHICLE THEFT STOLEN AUTOMOBILE INGLESIDE
120002257 2012-01-01 03:00:00 VEHICLE THEFT STOLEN AUTOMOBILE SOUTHERN
120135751 2012-01-01 08:00:00 VEHICLE THEFT STOLEN MOTORCYCLE INGLESIDE
120003368 2012-01-01 09:00:00 VEHICLE THEFT STOLEN AUTOMOBILE INGLESIDE
120498349 2012-01-01 12:00:00 VEHICLE THEFT STOLEN MOTORCYCLE INGLESIDE

Query

SELECT 
  incident_number,
  datetime,
  category,
  description,
  resolution,
  pd_district
FROM 
  incidents
WHERE 
  category = 'VEHICLE THEFT'
ORDER BY 
  datetime ASC
LIMIT 5;

4x. (EXAMPLE) Find the latest incident that happened on the first day of 2012

Related:

Expected results

incident_number datetime category description resolution pd_district
120002837 2012-01-01 23:05:00 VANDALISM MALICIOUS MISCHIEF, VANDALISM OF VEHICLES INGLESIDE
{:.table-sql}

Hints

The key to this exercise is to realize that the "datetime" field is a text field, as far as SQLite is concerned. Which means we can use text functions to extract dates and do comparisons.

Try:

SELECT datetime
FROM incidents
LIMIT 5;

Then:

SELECT datetime,
  SUBSTR(datetime, 1, 4)
FROM incidents
LIMIT 5;
datetime SUBSTR(datetime, 1, 4)
2012-01-01 00:01:00 2012
2012-01-01 00:01:00 2012
2012-01-01 00:01:00 2012
2012-01-01 00:01:00 2012
2012-01-01 00:01:00 2012

And whatever you'd like:

SELECT datetime,
  SUBSTR(datetime, 1, 4),
  SUBSTR(datetime, 1, 7)
FROM incidents
LIMIT 5;
datetime SUBSTR(datetime, 1, 4) SUBSTR(datetime, 1, 7)
2012-01-01 00:01:00 2012 2012-01
2012-01-01 00:01:00 2012 2012-01
2012-01-01 00:01:00 2012 2012-01
2012-01-01 00:01:00 2012 2012-01
2012-01-01 00:01:00 2012 2012-01
{:.table-sql}

What might be hard to figure out is how to set the condition to limit records in which the "datetime" value (which is not just a date but a time) is "equal" to the first day of January 2012.

This won't work:

SELECT 
  datetime
FROM incidents
WHERE datetime = '2012-01-01'
LIMIT 1;

Though using comparisons might work:

SELECT 
  datetime
FROM incidents
WHERE datetime > '2012-01-01' 
  AND datetime < '2012-01-02'
LIMIT 1;

There's all kind of variety:

SELECT 
  datetime
FROM incidents
WHERE datetime 
  BETWEEN '2012-01-01' AND '2012-01-02'
LIMIT 1;

Query

SELECT 
  incident_number,
  datetime,
  category,
  description,
  resolution,
  pd_district
FROM 
  incidents
WHERE 
  SUBSTR(datetime, 1, 10) = '2012-01-01'
ORDER BY 
  datetime DESC
LIMIT 1;

4. Find the earliest 5 incidents of the year 2014

Expected results

incident_number datetime category description resolution pd_district
140592131 2014-01-01 00:01:00 SEX OFFENSES, FORCIBLE SODOMY ARREST, BOOKED MISSION
140882346 2014-01-01 00:01:00 FRAUD FRAUDULENT USE OF AUTOMATED TELLER CARD NORTHERN
140309936 2014-01-01 00:01:00 OTHER OFFENSES FALSE PERSONATION TO RECEIVE MONEY OR PROPERTY UNFOUNDED RICHMOND
140471381 2014-01-01 00:01:00 FRAUD CREDIT CARD, THEFT BY USE OF CENTRAL
140057743 2014-01-01 00:01:00 SUSPICIOUS OCC SUSPICIOUS OCCURRENCE NORTHERN

5. Find the earliest 5 chronological vehicle thefts that had a resolution other than "NONE"

Expected results

incident_number datetime category description resolution pd_district
120005342 2012-01-03 06:42:00 VEHICLE THEFT ATTEMPTED STOLEN VEHICLE ARREST, BOOKED MISSION
120011355 2012-01-04 21:00:00 VEHICLE THEFT STOLEN AUTOMOBILE UNFOUNDED SOUTHERN
120016656 2012-01-05 16:00:00 VEHICLE THEFT STOLEN AUTOMOBILE ARREST, BOOKED RICHMOND
120017604 2012-01-06 10:30:00 VEHICLE THEFT STOLEN TRUCK ARREST, BOOKED INGLESIDE
120017165 2012-01-06 21:45:00 VEHICLE THEFT STOLEN AUTOMOBILE UNFOUNDED INGLESIDE

6. Find the last 5 incidents of the year 2014 that had something to do with cocaine

Related: Using LIKE, IN, BETWEEN, and wildcards to match multiple values in SQL

Expected Results

incident_number datetime category description resolution pd_district
141098374 2014-12-30 18:55:00 DRUG/NARCOTIC POSSESSION OF BASE/ROCK COCAINE FOR SALE ARREST, BOOKED TENDERLOIN
141096942 2014-12-30 11:30:00 DRUG/NARCOTIC POSSESSION OF BASE/ROCK COCAINE FOR SALE ARREST, BOOKED SOUTHERN
141094366 2014-12-29 15:10:00 DRUG/NARCOTIC POSSESSION OF BASE/ROCK COCAINE TENDERLOIN
141094366 2014-12-29 15:10:00 DRUG/NARCOTIC SALE OF BASE/ROCK COCAINE TENDERLOIN
141092928 2014-12-29 03:15:00 DRUG/NARCOTIC POSSESSION OF COCAINE INGLESIDE

7. Find the last 5 (chronological) "DRUG/NARCOTIC" crimes of the year 2014 that was not specifically about cocaine

Assume that "not specifically about cocaine" means that the word 'cocaine' did not show up in the description.

Expected Results

incident_number datetime category description resolution pd_district
141101610 2014-12-31 21:51:00 DRUG/NARCOTIC POSSESSION OF METH-AMPHETAMINE ARREST, BOOKED RICHMOND
141101444 2014-12-31 20:37:00 DRUG/NARCOTIC POSSESSION OF MARIJUANA ARREST, BOOKED MISSION
141099429 2014-12-31 08:08:00 DRUG/NARCOTIC POSSESSION OF CONTROLLED SUBSTANCE ARREST, BOOKED SOUTHERN
141099429 2014-12-31 08:08:00 DRUG/NARCOTIC POSSESSION OF NARCOTICS PARAPHERNALIA ARREST, BOOKED SOUTHERN
141099350 2014-12-31 03:16:00 DRUG/NARCOTIC POSSESSION OF MARIJUANA TARAVAL

SQL practice with SFPD incidents and simple aggregation (2 of 4)

Directions

This exercise set has 5 problems, not including two example problems (i.e. 1x and 5x).

Email me at dun@stanford.edu with the subject:

padjo-2017 homework sql-ex-2 your_sunet_id

With a Google Doc (attached or linked-to) in which each query has its own page.

The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:

http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip

Introduction

The SFPD data doesn't lend itself to much quantification, as the latitude and longitude columns are the only numerical columns, thus sum, average, max/min aren't very interesting aggregate methods.

So for this exercise, we just care about the COUNT function. When COUNT, or any aggregate function, is used without a GROUP BY clause, the result is a single value.

Pretend you had a spreadsheet and you used the =COUNTA() or =SUM() function across every row, i.e. =COUNTA(A1:A9999 and =SUM(A1:A9999). You would expect the function to deliver a single value.

The spreadsheet equivalent of "GROUP BY" is to do things using pivot tables. But for these SQL exercises, it's a singular count (i.e. not a pivot table, i.e. not using GROUP BY).

Besides the COUNT() function in the SELECT clause, all the other SQL syntax remains the same.

Related reading

Check out this past lesson, Aggregate functions in SQL. Specifically, focus on the The COUNT section and especially, the Counting without grouping section. (Ignore the GROUP BY stuff for now)

Exercises

1x. (EXAMPLE) Just do a count

The syntax for calculating the count of rows in a result set is simply:

SELECT COUNT(*); 

Believe it or not, even without reference to the database table (i.e. incidents), this has a valid result:

COUNT(*)
1

This can be interpreted as something like, The number of rows in the result for our useless queries is: 1.

1. Total number of incidents

Using SELECT COUNT(*) with a FROM clause will count the total number of rows in the table.

Expected result

total_count
751047

Hints

The expected answer expects a column name of total_count, so your query should include the proper alias clause.

2. Total number of incidents categorized as "ASSAULT"

Same as previous question, except there is a WHERE clause and a different alias.

Expected result

total_assaults
63874

Query

(I'm just giving you this answer)

SELECT 
  COUNT(*) AS total_assaults
FROM 
  incidents
WHERE 
  category = 'ASSAULT';

3. Total number of incidents having something to do with marijuana

Assume "having something to do with marijuana" means that the literal text 'marijuana' shows up in the "description" field, i.e. you will want to use the LIKE and wildcards.

Read: Fuzzy matching in SQL, particularly Matching patterns with LIKE

Expected result

total_count
5589

4. Total number of incidents taking place in 2014

Expected result

total_count
150128

5x. (EXAMPLE) Total number of incidents that took place in the "TENDERLOIN" district that were either "UNFOUNDED" or "NOT PROSECUTED"

Nothing different from the previous exercises, just a slightly more complicated logic in the WHERE clause.

Expected result

total_count
504

Hints

It is very easy to mix up the logic here. For example, this is NOT the logic that we want for finding both kinds of "resolution":

  resolution = 'UNFOUNDED'
  AND resolution = 'NOT PROSECUTED'

That is mutually exclusive (do the logic in your head). We want to use OR:

  resolution = 'UNFOUNDED'
  OR resolution = 'NOT PROSECUTED'

Now look at the "pd_district" = 'TENDERLOIN' logic separately:

pd_district = 'TENDERLOIN'

How do we combine the two conditional clauses?

Without parentheses

(the exerpt below is obviously just an excerpt, not valid syntax)

  
  pd_district = 'TENDERLOIN'
  AND resolution = 'UNFOUNDED'
  OR resolution = 'NOT PROSECUTED'

The above logic returns results in which:

  1. The district is "TENDERLOIN" and the resolution is "UNFOUNDED"
  2. The resolution is "NOT PROSECUTED"

The results of #2 include rows from districts that are not the TENDERLOIND

Using parentheses to group logic

Using parentheses, as we do in arithmetic, we can enforce the order of the logic:

  
  pd_district = 'TENDERLOIN'
  AND
  (resolution = 'UNFOUNDED'
   OR resolution = 'NOT PROSECUTED')

Unlike the previous bad example, there are only 2 separate conditions that must both be met. The latter condition, dealing with the resolution, can have multiple choices. But in the end, all results are in the 'TENDERLOIN' district.

Query

SELECT 
  COUNT(*) AS total_count
FROM 
  incidents
WHERE 
    pd_district = 'TENDERLOIN'
    AND (resolution = 'UNFOUNDED' 
     OR resolution = 'NOT PROSECUTED');

5. Total number of incidents taking place in 2016 in either the "TENDERLOIN" or "MISSION" districts

Expected results

total_count
29488

SQL practice with SFPD incidents and group by aggregation (3 of 4)

Directions

This exercise set has 9 problems, not including 3 example problems (2x, 5x, and 6x)

Email me at dun@stanford.edu with the subject:

padjo-2017 homework sql-ex-3 your_sunet_id

With a Google Doc (attached or linked-to) in which each query has its own page.

The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:

http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip

Introduction

Unlike the previous set of exercises, which practiced using the COUNT aggregation function to provide a single result value (i.e. the row count of the result set), these exercises use aggregation with the GROUP BY clause.

Instead of counting the number of results, we can get a count of results by group. Just like we can with pivot tables.

Related reading

Exercises

1. Group by the resolution field, list in alphabetical order

Expected result

resolution
ARREST, BOOKED
ARREST, CITED
CLEARED-CONTACT JUVENILE FOR MORE INFO
COMPLAINANT REFUSES TO PROSECUTE
DISTRICT ATTORNEY REFUSES TO PROSECUTE
EXCEPTIONAL CLEARANCE
JUVENILE ADMONISHED
JUVENILE BOOKED
JUVENILE CITED
JUVENILE DIVERTED
LOCATED
NONE
NOT PROSECUTED
PROSECUTED BY OUTSIDE AGENCY
PROSECUTED FOR LESSER OFFENSE
PSYCHOPATHIC CASE
UNFOUNDED

Hints

This is pretty straightforward; requires a "GROUP BY" and "ORDER BY" clauses

2x. (EXAMPLE) Do a group count of the resolution field, list resolutions in alphabetical order

Expected results

resolution total_count
ARREST, BOOKED 174718
ARREST, CITED 34180
CLEARED-CONTACT JUVENILE FOR MORE INFO 362
COMPLAINANT REFUSES TO PROSECUTE 4595
DISTRICT ATTORNEY REFUSES TO PROSECUTE 1055
EXCEPTIONAL CLEARANCE 2188
JUVENILE ADMONISHED 581
JUVENILE BOOKED 4781
JUVENILE CITED 1740
JUVENILE DIVERTED 190
LOCATED 10411
NONE 494882
NOT PROSECUTED 504
PROSECUTED BY OUTSIDE AGENCY 975
PROSECUTED FOR LESSER OFFENSE 5
PSYCHOPATHIC CASE 11563
UNFOUNDED 8317

Query

SELECT 
  resolution,
  COUNT(*) AS total_count
FROM
  incidents
GROUP BY 
  resolution
ORDER BY 
  resolution ASC;

2. List the 5 most frequently-occurring values of the resolution field

Same as previous question, just a different sort order, and a LIMIT clause.

Expected results

resolution total_count
NONE 494882
ARREST, BOOKED 174718
ARREST, CITED 34180
PSYCHOPATHIC CASE 11563
LOCATED 10411
{:.table-sql}

3. List the top 10 most frequently-occurring values of the category field

Pretty much the exact same thing as the previous question, with a few details changed. See if you can write this out from scratch:

Expected results

category total_count
LARCENY/THEFT 187882
OTHER OFFENSES 98907
NON-CRIMINAL 94489
ASSAULT 63874
VANDALISM 38162
VEHICLE THEFT 33896
WARRANTS 33163
BURGLARY 30241
SUSPICIOUS OCC 28062
DRUG/NARCOTIC 27125

4. List the top 10 most frequently occurring category and description combinations

Virtually the same as the prior question, except we're grouping by, and showing, two columns (besides the "total_count" aggregate column).

Expected results

category description total_count
LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO 73239
NON-CRIMINAL LOST PROPERTY 23236
LARCENY/THEFT PETTY THEFT OF PROPERTY 22825
OTHER OFFENSES DRIVERS LICENSE, SUSPENDED OR REVOKED 22497
NON-CRIMINAL AIDED CASE, MENTAL DISTURBED 22228
VEHICLE THEFT STOLEN AUTOMOBILE 21383
ASSAULT BATTERY 19121
WARRANTS WARRANT ARREST 18235
LARCENY/THEFT PETTY THEFT FROM LOCKED AUTO 18046
VANDALISM MALICIOUS MISCHIEF, VANDALISM 16465

5x. (EXAMPLE) List all descriptions of incidents categorized as "MISSING PERSON", in order of frequency

This is virtually the same as the prior exercise, but I make this an example to show something subtle about the results.

Expected results

This is what I want to see -- note that the "category" column is left out:

description total_count
FOUND PERSON 10981
MISSING ADULT 7054
MISSING JUVENILE 4555

Hints

If you piggyback from the answer in the previous exercise, you might come up with this:

SELECT 
  category, description,
  COUNT(*) AS total_count
FROM
  incidents
WHERE 
  category = 'MISSING PERSON'
GROUP BY 
  category, description
ORDER BY 
  total_count DESC;

Which results in this:

category description total_count
MISSING PERSON FOUND PERSON 10981
MISSING PERSON MISSING ADULT 7054
MISSING PERSON MISSING JUVENILE 4555

The "total_count" column ends up being exactly the same, so the calculations were the same. The only difference is in structure: your result has the "category" column and my expected result does not. Why do I prefer the latter?

Think about what is happening in the WHERE clause:

WHERE 
  category = 'MISSING PERSON'

By definition, all results are going to have the "category" of 'MISSING PERSON', i.e. every row will have 'MISSING PERSON' or else it wouldn't be included in the results. What's the purpose of doing a group count by "category" if we know there aren't different groups?

And since we aren't doing a GROUP BY that refers to the "category" column, best practice is to omit it from the SELECT clause too.

Query

Note how this query omits mention of the "category" column for purposes of displaying (via SELECT) or grouping (via GROUP BY):

SELECT 
  description,
  COUNT(*) AS total_count
FROM
  incidents
WHERE 
  category = 'MISSING PERSON'
GROUP BY 
  description
ORDER BY 
  total_count DESC;

5. List all descriptions of incidents categorized as either "MISSING PERSON" or "TRESPASS", in order of frequency

This is a situation in which we do want to include "category" as it can take on more than one value (two, in this case).

Expected results

Without the "category" column, we'd run into problems in which two "description" values had different categories and are meant to be treated differently. That isn't the case with the categories of "MISSING PERSON" and "TRESSPASS", but we have no idea the taxonomy used by the SFPD.

category description total_count
MISSING PERSON FOUND PERSON 10981
MISSING PERSON MISSING ADULT 7054
TRESPASS TRESPASSING 5430
MISSING PERSON MISSING JUVENILE 4555
TRESPASS LODGING WITHOUT PERMISSION 1288
TRESPASS LOITERING WITHOUT LAWFUL BUSINESS WITH OWNER OR OCCUPANT 30
TRESPASS TRESPASSING ON RAILROAD TRAINS 7
TRESPASS DISASTER AREA, ENTERING OR REMAINING IN 3

6x. (EXAMPLE) List total incidents per year

As before, treat the "datetime" column as a string, and use the SUBSTR function to select the first 4 characters and effectively create a new column. Alias that column as "year" and sort by that alias.

Expected results

year total_count
2012 140847
2013 152806
2014 150128
2015 156530
2016 150736

Query

  SELECT 
    SUBSTR(datetime, 1, 4) AS year,
    COUNT(*) AS total_count
  FROM incidents
  GROUP BY year
  ORDER BY year ASC; 

6. List total incidents by month (ignoring year)

Sometimes it's interesting to count/group by a time cycle, such as by month (as opposed to year and month). This is the same as counting by year, except extracting a different part of the datetime string.

Expected results

month total_count
01 63726
02 58656
03 63582
04 61351
05 63459
06 60730
07 62828
08 65574
09 63561
10 66932
11 61149
12 59499

Hints

If you can't figure out the arguments for the SUBSTR function to excerpt that middle part of the datetime value, try it out iteratively:

SELECT
  datetime,
  SUBSTR(datetime, 1, 10)
FROM incidents
LIMIT 5;
datetime SUBSTR(datetime, 1, 10)
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01
2012-01-01 00:01:00 2012-01-01

Keep trying until you get what you want:

SELECT
  datetime,
  SUBSTR(datetime, 2, 7)
FROM incidents
LIMIT 5;
datetime SUBSTR(datetime, 2, 7)
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-
2012-01-01 00:01:00 012-01-

7. Yearly count of incidents categorized as "DRUG/NARCOTIC"

Expected result

year total_count
2012 6444
2013 6775
2014 5408
2015 4254
2016 4244

8. Yearly count of incidents having anything to do with "marijuana"

Pretend that there are "marijuana"-related incidents that are not just in the category of "DRUG/NARCOTIC". Assume that "having anything to do with 'marijuana'" means that the word 'marijuana' appears somewhere in the record.

Read: Fuzzy matching in SQL, particularly Matching patterns with LIKE

Expected results

year total_count
2012 1395
2013 1371
2014 1191
2015 892
2016 740

9. Yearly count of "DRUG/NARCOTIC" incidents that do not mention marijuana

Answer is very similar to the previous one, but requires a WHERE condition to filter on the "category" and "description" columns.

Expected results

year total_count
2012 5049
2013 5404
2014 4217
2015 3362
2016 3504
{:.table-sql}

SQL practice with SFPD incidents and the HAVING clause (4 of 4)

Directions

This exercise set has 3 problems, not including 1 example problems, 1x.

Email me at dun@stanford.edu with the subject:

padjo-2017 homework sql-ex-4 your_sunet_id

With a Google Doc (attached or linked-to) in which each query has its own page.

The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:

http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip

Introduction

Related: Stack Overflow: Using group by and having clause

(warning: tortured analogy)

Understanding the HAVING clause doesn't require much more than understanding WHERE. However, the reason why HAVING exists is due to the internals (i.e. something we treat as a black box) of how SQL processes our query.

Imagine a table full of assorted fruit, bananas, oranges, kiwis, just all over the place.

If we wanted just the fruit that was bananas, we'd say something like:

Gather up the fruit that are bananas,

SELECT * 
FROM fruits 
WHERE type = 'banana';

What if we wanted to collect just the yellowish-fruits which as a group numbered at least 10 or more, i.e.

I want just the yellow fruits of which there are more than 10 of each on this table

In SQL, that query would look something like this:

SELECT type, 
  COUNT(*) AS quantity
FROM fruits
WHERE color = 'yellow'
GROUP BY type
HAVING quantity > 10;

In SQL, that quantity > 10 must be in the HAVING clause. Because the filtering conditions in the WHERE clause are done before the group-by-aggregation. The column "quantity" simply does not exist before the GROUP BY is done.

To go back to our table full of fruit analogy: it's possible to gather/discard fruit based on the color of 'yellow' before we start grouping the fruit into groups by type, i.e. 'lemon', 'bananas', etc. However, it is only that we've grouped the fruit together that we can count each type's quantity

1x. (EXAMPLE) List the categories in which the incident count is greater than 50,000

Expected results

category total_count
LARCENY/THEFT 187882
OTHER OFFENSES 98907
NON-CRIMINAL 94489
ASSAULT 63874

Start off by doing a group count by grouping by the "category" column and doing a reverse sort of the count column (aliased as "total_count"):

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
ORDER BY total_count DESC;

For the sake of brevity, here are the first 10 or so rows:

category total_count
LARCENY/THEFT 187882
OTHER OFFENSES 98907
NON-CRIMINAL 94489
ASSAULT 63874
VANDALISM 38162
VEHICLE THEFT 33896
WARRANTS 33163
BURGLARY 30241
SUSPICIOUS OCC 28062
DRUG/NARCOTIC 27125
MISSING PERSON 22590

Hints

The point of the HAVING clause is to filter on the aggregated column, i.e. the "total_count" column. The aggregate happens because of the GROUP BY clause, so HAVING always comes after GROUP BY. Otherwise, it's not much different than WHERE

Query

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
HAVING total_count > 50000
ORDER BY total_count DESC;
Further discussion

A common error is to try to use WHERE when only HAVING will do:

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE total_count > 50000
GROUP BY category
ORDER BY total_count DESC;

You should get this error:

misuse of aggregate: COUNT()

Likewise, HAVING is easily confused with WHERE. The following query does a silly filter for rows in which "category" is equal to 'BURGLARY', i.e. all the results will be burglaries:

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
HAVING category = 'BURGLARY'
ORDER BY total_count DESC;

However, this doesn't throw an error because HAVING works pretty much the same as WHERE:

category total_count
BURGLARY 30241

This is equally nonsensical without throwing any errors:

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE category = 'BURGLARY'
GROUP BY category
HAVING category = 'BURGLARY'
ORDER BY total_count DESC;

It still works. But it is nonsensical in a redundant way, like saying:

I want to buy the cat with a hat that is a cat

To reduce confusion, best practice in terms of SQL style is to only do filtering on aggregate columns in the HAVING clause.

1. List in alphabetical order the descriptions that have more than 20,000 incidents

Expected results

description total_count
AIDED CASE, MENTAL DISTURBED 22228
DRIVERS LICENSE, SUSPENDED OR REVOKED 22497
GRAND THEFT FROM LOCKED AUTO 73239
LOST PROPERTY 23236
PETTY THEFT OF PROPERTY 22825
STOLEN AUTOMOBILE 21383

2. For the year 2015, list the precincts that had more than 20,000 incidents, in order of most incidents

Expected results

pd_district total_count
SOUTHERN 30095
NORTHERN 20089

3. List the top 5 descriptions that have more than 1,000 incidents for the years 2014 through 2016, sorted in order of shortest to longest description

This one is a bit of a trick question. Not much different from the previous questions, but uses a LENGTH function (to determine the length of a string) as a way to sort the results.

If you can get this, then you have a decent sense of how WHERE and GROUP BY are related but different...

Expected results

description total_count
BATTERY 11786
AIDED CASE 3185
CONSPIRACY 1410
TRESPASSING 3497
CASE CLOSURE 2272

Hints

In case you need a reminder on boolean expressions:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment