dannguyen/answers-sql-sfpd-incidents-having.md

## answers-sql-sfpd-incidents-having.md

      
    Raw
  

              answers-sql-sfpd-incidents-having.md
            
          
    SQL practice with SFPD incidents and the HAVING clause (4 of 4)

Directions

This exercise set has 3 problems, not including 1 example problems, 1x.
Email me at dun@stanford.edu with the subject:
padjo-2017 homework **sql-ex-4** **your_sunet_id**

With a Google Doc (attached or linked-to) in which each query has its own page.
The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:
http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip
Introduction

Related: Stack Overflow: Using group by and having clause
(warning: tortured analogy)
Understanding the HAVING clause doesn't require much more than understanding WHERE. However, the reason why HAVING exists is due to the internals (i.e. something we treat as a black box) of how SQL processes our query.
Imagine a table full of assorted fruit, bananas, oranges, kiwis, just all over the place.
If we wanted just the fruit that was bananas, we'd say something like:
Gather up the fruit that are bananas,
SELECT * 
FROM fruits 
WHERE type = 'banana';
What if we wanted to collect just the yellowish-fruits which as a group numbered at least 10 or more, i.e.
I want just the yellow fruits of which there are more than 10 of each on this table
In SQL, that query would look something like this:
SELECT type, 
  COUNT(*) AS quantity
FROM fruits
WHERE color = 'yellow'
GROUP BY type
HAVING quantity > 10;
In SQL, that quantity > 10 must be in the HAVING clause. Because the filtering conditions in the WHERE clause are done before the group-by-aggregation. The column "quantity" simply does not exist before the GROUP BY is done.
To go back to our table full of fruit analogy: it's possible to gather/discard fruit based on the color of 'yellow' before we start grouping the fruit into groups by type, i.e. 'lemon', 'bananas', etc. However, it is only that we've grouped the fruit together that we can count each type's quantity
1x. (EXAMPLE) List the categories in which the incident count is greater than 50,000

Expected results


category
total_count


LARCENY/THEFT
187882


OTHER OFFENSES
98907


NON-CRIMINAL
94489


ASSAULT
63874


Start off by doing a group count by grouping by the "category" column and doing a reverse sort of the count column (aliased as "total_count"):
SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
ORDER BY total_count DESC;
For the sake of brevity, here are the first 10 or so rows:


category
total_count


LARCENY/THEFT
187882


OTHER OFFENSES
98907


NON-CRIMINAL
94489


ASSAULT
63874


VANDALISM
38162


VEHICLE THEFT
33896


WARRANTS
33163


BURGLARY
30241


SUSPICIOUS OCC
28062


DRUG/NARCOTIC
27125


MISSING PERSON
22590


Hints

The point of the HAVING clause is to filter on the aggregated column, i.e. the "total_count" column. The aggregate happens because of the GROUP BY clause, so HAVING always comes after GROUP BY. Otherwise, it's not much different than WHERE
Query

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
HAVING total_count > 50000
ORDER BY total_count DESC;
Further discussion

A common error is to try to use WHERE when only HAVING will do:
SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE total_count > 50000
GROUP BY category
ORDER BY total_count DESC;
You should get this error:
misuse of aggregate: COUNT()

Likewise, HAVING is easily confused with WHERE. The following query does a silly filter for rows in which "category" is equal to 'BURGLARY', i.e. all the results will be burglaries:
SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
HAVING category = 'BURGLARY'
ORDER BY total_count DESC;
However, this doesn't throw an error because HAVING works pretty much the same as WHERE:


category
total_count


BURGLARY
30241


This is equally nonsensical without throwing any errors:
SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE category = 'BURGLARY'
GROUP BY category
HAVING category = 'BURGLARY'
ORDER BY total_count DESC;
It still works. But it is nonsensical in a redundant way, like saying:

I want to buy the cat with a hat that is a cat

To reduce confusion, best practice in terms of SQL style is to only do filtering on aggregate columns in the HAVING clause.
1. List in alphabetical order the descriptions that have more than 20,000 incidents

Expected results


description
total_count


AIDED CASE, MENTAL DISTURBED
22228


DRIVERS LICENSE, SUSPENDED OR REVOKED
22497


GRAND THEFT FROM LOCKED AUTO
73239


LOST PROPERTY
23236


PETTY THEFT OF PROPERTY
22825


STOLEN AUTOMOBILE
21383


Query

SELECT 
  description,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY description
HAVING total_count > 20000
ORDER BY description ASC;
2. For the year 2015, list the precincts that had more than 20,000 incidents, in order of most incidents

Expected results


pd_district
total_count


SOUTHERN
30095


NORTHERN
20089


Query

SELECT 
  pd_district,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE 
  SUBSTR(datetime, 1, 4) = '2015'
GROUP BY pd_district
HAVING total_count > 20000
ORDER BY total_count DESC;
3. List the top 5 descriptions that have more than 1,000 incidents for the years 2014 through 2016, sorted in order of shortest to longest description

This one is a bit of a trick question. Not much different from the previous questions, but uses a LENGTH function (to determine the length of a string) as a way to sort the results.
If you can get this, then you have a decent sense of how WHERE and GROUP BY are related but different...
Expected results


description
total_count


BATTERY
11786


AIDED CASE
3185


CONSPIRACY
1410


TRESPASSING
3497


CASE CLOSURE
2272


FOUND PERSON
6613


STOLEN TRUCK
4595


LOST PROPERTY
13874


MISSING ADULT
4407


FOUND PROPERTY
9190


Hints

In case you need a reminder on boolean expressions:

Conditional filtering with WHERE
Fuzzy matching in SQL
Using the WHERE clause to filter data in SQL
More Boolean Expressions to Filter SQL Queries
Using LIKE, IN, BETWEEN, and wildcards to match multiple values in SQL

Query

SELECT 
  description,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE 
  SUBSTR(datetime, 1, 4) IN ('2014', '2015', '2016')
GROUP BY description
HAVING total_count > 1000
ORDER BY 
  LENGTH(description) ASC
LIMIT 10;
category	total_count
LARCENY/THEFT	187882
OTHER OFFENSES	98907
NON-CRIMINAL	94489
ASSAULT	63874
description	total_count
AIDED CASE, MENTAL DISTURBED	22228
DRIVERS LICENSE, SUSPENDED OR REVOKED	22497
GRAND THEFT FROM LOCKED AUTO	73239
LOST PROPERTY	23236
PETTY THEFT OF PROPERTY	22825
STOLEN AUTOMOBILE	21383
description	total_count
BATTERY	11786
AIDED CASE	3185
CONSPIRACY	1410
TRESPASSING	3497
CASE CLOSURE	2272
FOUND PERSON	6613
STOLEN TRUCK	4595
LOST PROPERTY	13874
MISSING ADULT	4407
FOUND PROPERTY	9190