Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Created October 31, 2017 19:37
Show Gist options
  • Save dannguyen/10a698d57f0b82d9c60423160d9f1837 to your computer and use it in GitHub Desktop.
Save dannguyen/10a698d57f0b82d9c60423160d9f1837 to your computer and use it in GitHub Desktop.

SQL practice with SFPD incidents and the HAVING clause (4 of 4)

Directions

This exercise set has 3 problems, not including 1 example problems, 1x.

Email me at dun@stanford.edu with the subject:

padjo-2017 homework **sql-ex-4** **your_sunet_id**

With a Google Doc (attached or linked-to) in which each query has its own page.

The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:

http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip

Introduction

Related: Stack Overflow: Using group by and having clause

(warning: tortured analogy)

Understanding the HAVING clause doesn't require much more than understanding WHERE. However, the reason why HAVING exists is due to the internals (i.e. something we treat as a black box) of how SQL processes our query.

Imagine a table full of assorted fruit, bananas, oranges, kiwis, just all over the place.

If we wanted just the fruit that was bananas, we'd say something like:

Gather up the fruit that are bananas,

SELECT * 
FROM fruits 
WHERE type = 'banana';

What if we wanted to collect just the yellowish-fruits which as a group numbered at least 10 or more, i.e.

I want just the yellow fruits of which there are more than 10 of each on this table

In SQL, that query would look something like this:

SELECT type, 
  COUNT(*) AS quantity
FROM fruits
WHERE color = 'yellow'
GROUP BY type
HAVING quantity > 10;

In SQL, that quantity > 10 must be in the HAVING clause. Because the filtering conditions in the WHERE clause are done before the group-by-aggregation. The column "quantity" simply does not exist before the GROUP BY is done.

To go back to our table full of fruit analogy: it's possible to gather/discard fruit based on the color of 'yellow' before we start grouping the fruit into groups by type, i.e. 'lemon', 'bananas', etc. However, it is only that we've grouped the fruit together that we can count each type's quantity

1x. (EXAMPLE) List the categories in which the incident count is greater than 50,000

Expected results

category total_count
LARCENY/THEFT 187882
OTHER OFFENSES 98907
NON-CRIMINAL 94489
ASSAULT 63874

Start off by doing a group count by grouping by the "category" column and doing a reverse sort of the count column (aliased as "total_count"):

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
ORDER BY total_count DESC;

For the sake of brevity, here are the first 10 or so rows:

category total_count
LARCENY/THEFT 187882
OTHER OFFENSES 98907
NON-CRIMINAL 94489
ASSAULT 63874
VANDALISM 38162
VEHICLE THEFT 33896
WARRANTS 33163
BURGLARY 30241
SUSPICIOUS OCC 28062
DRUG/NARCOTIC 27125
MISSING PERSON 22590

Hints

The point of the HAVING clause is to filter on the aggregated column, i.e. the "total_count" column. The aggregate happens because of the GROUP BY clause, so HAVING always comes after GROUP BY. Otherwise, it's not much different than WHERE

Query

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
HAVING total_count > 50000
ORDER BY total_count DESC;
Further discussion

A common error is to try to use WHERE when only HAVING will do:

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE total_count > 50000
GROUP BY category
ORDER BY total_count DESC;

You should get this error:

misuse of aggregate: COUNT()

Likewise, HAVING is easily confused with WHERE. The following query does a silly filter for rows in which "category" is equal to 'BURGLARY', i.e. all the results will be burglaries:

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
HAVING category = 'BURGLARY'
ORDER BY total_count DESC;

However, this doesn't throw an error because HAVING works pretty much the same as WHERE:

category total_count
BURGLARY 30241

This is equally nonsensical without throwing any errors:

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE category = 'BURGLARY'
GROUP BY category
HAVING category = 'BURGLARY'
ORDER BY total_count DESC;

It still works. But it is nonsensical in a redundant way, like saying:

I want to buy the cat with a hat that is a cat

To reduce confusion, best practice in terms of SQL style is to only do filtering on aggregate columns in the HAVING clause.

1. List in alphabetical order the descriptions that have more than 20,000 incidents

Expected results

description total_count
AIDED CASE, MENTAL DISTURBED 22228
DRIVERS LICENSE, SUSPENDED OR REVOKED 22497
GRAND THEFT FROM LOCKED AUTO 73239
LOST PROPERTY 23236
PETTY THEFT OF PROPERTY 22825
STOLEN AUTOMOBILE 21383

Query

SELECT 
  description,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY description
HAVING total_count > 20000
ORDER BY description ASC;

2. For the year 2015, list the precincts that had more than 20,000 incidents, in order of most incidents

Expected results

pd_district total_count
SOUTHERN 30095
NORTHERN 20089

Query

SELECT 
  pd_district,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE 
  SUBSTR(datetime, 1, 4) = '2015'
GROUP BY pd_district
HAVING total_count > 20000
ORDER BY total_count DESC;

3. List the top 5 descriptions that have more than 1,000 incidents for the years 2014 through 2016, sorted in order of shortest to longest description

This one is a bit of a trick question. Not much different from the previous questions, but uses a LENGTH function (to determine the length of a string) as a way to sort the results.

If you can get this, then you have a decent sense of how WHERE and GROUP BY are related but different...

Expected results

description total_count
BATTERY 11786
AIDED CASE 3185
CONSPIRACY 1410
TRESPASSING 3497
CASE CLOSURE 2272
FOUND PERSON 6613
STOLEN TRUCK 4595
LOST PROPERTY 13874
MISSING ADULT 4407
FOUND PROPERTY 9190

Hints

In case you need a reminder on boolean expressions:

Query

SELECT 
  description,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE 
  SUBSTR(datetime, 1, 4) IN ('2014', '2015', '2016')
GROUP BY description
HAVING total_count > 1000
ORDER BY 
  LENGTH(description) ASC
LIMIT 10;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment