dannguyen/00.md

## 00.md

      
    Raw
  

              00.md
            
          
    THE TOP NAV

(fyi, the class homepage: http://2017.padjo.org//assignments/homework/sql-practice-1/sql-exercises-sfpd)
This is a quick nav to the 4 exercise sections -- it's easy to get lost if you try to scroll through them
Due: Wednesday, October 24, 2017, 11:59 PM

SQL SFPD exercises 1 of 4
SQL SFPD exercises 2 of 4

Due: Friday, October 26, 2017, 11:59 PM

SQL SFPD exercises 3 of 4
SQL SFPD exercises 4 of 4

Basic instructions

Each exercise section has a set of exercises with a (perhaps too vague) prompt, like:
Total number of incidents categorized as "ASSAULT"
And the expected results from a SQL query:


total_assaults


63874


Your job is to: write the SQL query that answers the prompt and gets ther results.
In this case:
SELECT 
  COUNT(*) AS total_assaults
FROM 
  incidents
WHERE 
  category = 'ASSAULT';
For each exercise section, you'll email me a Word/Google doc in which each page has its own query.
Btw, the above example was the actual answer for Set 2, Exercise 2, but please write it out for yourself and make sure it works before just handing it in.

  
## 01-select-sort-filter.md

      
    Raw
  

              01-select-sort-filter.md
            
          
    SQL practice with SFPD Incidents; Select/Sort/Filter (1 of 4)

Answers: https://gist.github.com/dannguyen/d06f8fb41722dd777c70acf83b1e03ad
Introduction

This is an overview lesson that covers all of the syntax needed to find and sort records in a database before getting into the  GROUP BY clause.
Think of these exercises as covering database basics before getting into the aggregation functionality of GROUP BY. It's just like wanting to cover the many features of spreadsheets before getting into pivot tables and aggregation.
Directions

This exercise set has 6 problems, not including two example problems (i.e. 1x and 4x).
Email me at dun@stanford.edu with the subject:
padjo-2017 homework sql-ex-1 your_sunet_id

With a Google Doc (attached or linked-to) in which each query has its own page.
The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:
http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip
Related reading

From the 2014 syllabus (more complete, maybe not as concise):

The SELECT statement and FROM clause
Limiting and ordering SQL results
Conditional filtering with WHERE
Fuzzy matching in SQL

From the 2015 syllabus -- more broken down than 2014, but might have some incomplete spots:

Getting started with SQLite Browser
SELECTing rows FROM data tables
LIMIT and ORDER BY in SQL queries
USING the WHERE clause to filter data in SQL
More Boolean Expressions to Filter SQL Queries
Using LIKE, IN, BETWEEN, and wildcards to match multiple values in SQL
Functions for transforming text and numbers in SQL
Aliasing Columns and Tables in SQL Databases

It is an overwhelming number of concepts and syntax, but mastery isn't expected, just familiarity for now.

Exercises

1x. (EXAMPLE) Show just the datetime column from the very first incident in the database

"Very first incident" in the database basically means, whatever the first row is in the data table without an additional sorting of our own. Use a LIMIT clause to make sure the results only include a single row.
Related:

SELECTing rows FROM data tables
LIMIT and ORDER BY in SQL queries

Expected result


datetime


2012-01-01 00:01:00


Hints

If this is your first SQL of the day, you can try these queries first:


Select 'hello'
SELECT 'hello';


Select 'hello' for every row that is in "incidents"
SELECT 'hello'
FROM incidents;


Select all the columns for all the rows in "incidents" (may be a lot for computer to process)
SELECT *
FROM incidents;


Select all the columns for the first 5 rows
SELECT *
FROM incidents;
LIMIT 5;


Query

SELECT datetime
FROM incidents
LIMIT 1;
1. Find selected columns from the very first incident in the database

Many of the columns in this database aren't needed for the exercises. So write a query that pulls only these fields, in this order:

incident_number
datetime
category
description
resolution
pd_district

Expected result


incident_number
datetime
category
description
resolution
pd_district


130,261,396
2012-01-01 00:01:00
OTHER OFFENSES
FALSE PERSONATION

BAYVIEW


Query

SELECT 
  incident_number,
  datetime,
  category,
  description,
  resolution,
  pd_district
FROM incidents
LIMIT 1;
2. Find the very last incident (chronologically) in the database

Related:

LIMIT and ORDER BY in SQL queries

Expected results


incident_number
datetime
category
description
resolution
pd_district


170000198
2016-12-31 23:58:00
DRIVING UNDER THE INFLUENCE
DRIVING WHILE UNDER THE INFLUENCE OF DRUGS
ARREST, BOOKED
MISSION


{:.table-sql}


Query

SELECT 
  incident_number,
  datetime,
  category,
  description,
  resolution,
  pd_district
FROM 
  incidents
ORDER BY 
  datetime DESC
LIMIT 1;
3.  Find the earliest five crimes (chronologically) categorized as "VEHICLE THEFT"

Related:

Conditional filtering with WHERE
USING the WHERE clause to filter data in SQL

Expected results


incident_number
datetime
category
description
resolution
pd_district


120001613
2012-01-01 02:30:00
VEHICLE THEFT
STOLEN AUTOMOBILE

INGLESIDE


120002257
2012-01-01 03:00:00
VEHICLE THEFT
STOLEN AUTOMOBILE

SOUTHERN


120135751
2012-01-01 08:00:00
VEHICLE THEFT
STOLEN MOTORCYCLE

INGLESIDE


120003368
2012-01-01 09:00:00
VEHICLE THEFT
STOLEN AUTOMOBILE

INGLESIDE


120498349
2012-01-01 12:00:00
VEHICLE THEFT
STOLEN MOTORCYCLE

INGLESIDE


Query

SELECT 
  incident_number,
  datetime,
  category,
  description,
  resolution,
  pd_district
FROM 
  incidents
WHERE 
  category = 'VEHICLE THEFT'
ORDER BY 
  datetime ASC
LIMIT 5;
4x. (EXAMPLE) Find the latest incident that happened on the first day of 2012

Related:

Conditional filtering with WHERE
USING the WHERE clause to filter data in SQL
Using LIKE, IN, BETWEEN, and wildcards to match multiple values in SQL

Expected results


incident_number
datetime
category
description
resolution
pd_district


120002837
2012-01-01 23:05:00
VANDALISM
MALICIOUS MISCHIEF, VANDALISM OF VEHICLES

INGLESIDE


{:.table-sql}


Hints

The key to this exercise is to realize that the "datetime" field is a text field, as far as SQLite is concerned. Which means we can use text functions to extract dates and do comparisons.
Try:
SELECT datetime
FROM incidents
LIMIT 5;
Then:
SELECT datetime,
  SUBSTR(datetime, 1, 4)
FROM incidents
LIMIT 5;


datetime
SUBSTR(datetime, 1, 4)


2012-01-01 00:01:00
2012


2012-01-01 00:01:00
2012


2012-01-01 00:01:00
2012


2012-01-01 00:01:00
2012


2012-01-01 00:01:00
2012


And whatever you'd like:
SELECT datetime,
  SUBSTR(datetime, 1, 4),
  SUBSTR(datetime, 1, 7)
FROM incidents
LIMIT 5;


datetime
SUBSTR(datetime, 1, 4)
SUBSTR(datetime, 1, 7)


2012-01-01 00:01:00
2012
2012-01


2012-01-01 00:01:00
2012
2012-01


2012-01-01 00:01:00
2012
2012-01


2012-01-01 00:01:00
2012
2012-01


2012-01-01 00:01:00
2012
2012-01


{:.table-sql}


What might be hard to figure out is how to set the condition to limit records in which the "datetime" value (which is not just a date but a time) is "equal" to the first day of January 2012.
This won't work:
SELECT 
  datetime
FROM incidents
WHERE datetime = '2012-01-01'
LIMIT 1;
Though using comparisons might work:
SELECT 
  datetime
FROM incidents
WHERE datetime > '2012-01-01' 
  AND datetime < '2012-01-02'
LIMIT 1;
There's all kind of variety:
SELECT 
  datetime
FROM incidents
WHERE datetime 
  BETWEEN '2012-01-01' AND '2012-01-02'
LIMIT 1;
Query

SELECT 
  incident_number,
  datetime,
  category,
  description,
  resolution,
  pd_district
FROM 
  incidents
WHERE 
  SUBSTR(datetime, 1, 10) = '2012-01-01'
ORDER BY 
  datetime DESC
LIMIT 1;
4. Find the earliest 5 incidents of the year 2014

Expected results


incident_number
datetime
category
description
resolution
pd_district


140592131
2014-01-01 00:01:00
SEX OFFENSES, FORCIBLE
SODOMY
ARREST, BOOKED
MISSION


140882346
2014-01-01 00:01:00
FRAUD
FRAUDULENT USE OF AUTOMATED TELLER CARD

NORTHERN


140309936
2014-01-01 00:01:00
OTHER OFFENSES
FALSE PERSONATION TO RECEIVE MONEY OR PROPERTY
UNFOUNDED
RICHMOND


140471381
2014-01-01 00:01:00
FRAUD
CREDIT CARD, THEFT BY USE OF

CENTRAL


140057743
2014-01-01 00:01:00
SUSPICIOUS OCC
SUSPICIOUS OCCURRENCE

NORTHERN


5. Find the earliest 5 chronological vehicle thefts that had a resolution other than "NONE"

Expected results


incident_number
datetime
category
description
resolution
pd_district


120005342
2012-01-03 06:42:00
VEHICLE THEFT
ATTEMPTED STOLEN VEHICLE
ARREST, BOOKED
MISSION


120011355
2012-01-04 21:00:00
VEHICLE THEFT
STOLEN AUTOMOBILE
UNFOUNDED
SOUTHERN


120016656
2012-01-05 16:00:00
VEHICLE THEFT
STOLEN AUTOMOBILE
ARREST, BOOKED
RICHMOND


120017604
2012-01-06 10:30:00
VEHICLE THEFT
STOLEN TRUCK
ARREST, BOOKED
INGLESIDE


120017165
2012-01-06 21:45:00
VEHICLE THEFT
STOLEN AUTOMOBILE
UNFOUNDED
INGLESIDE


6. Find the last 5 incidents of the year 2014 that had something to do with cocaine

Related: Using LIKE, IN, BETWEEN, and wildcards to match multiple values in SQL
Expected Results


incident_number
datetime
category
description
resolution
pd_district


141098374
2014-12-30 18:55:00
DRUG/NARCOTIC
POSSESSION OF BASE/ROCK COCAINE FOR SALE
ARREST, BOOKED
TENDERLOIN


141096942
2014-12-30 11:30:00
DRUG/NARCOTIC
POSSESSION OF BASE/ROCK COCAINE FOR SALE
ARREST, BOOKED
SOUTHERN


141094366
2014-12-29 15:10:00
DRUG/NARCOTIC
POSSESSION OF BASE/ROCK COCAINE

TENDERLOIN


141094366
2014-12-29 15:10:00
DRUG/NARCOTIC
SALE OF BASE/ROCK COCAINE

TENDERLOIN


141092928
2014-12-29 03:15:00
DRUG/NARCOTIC
POSSESSION OF COCAINE

INGLESIDE


7.  Find the last 5 (chronological) "DRUG/NARCOTIC" crimes of the year 2014 that was not specifically about cocaine

Assume that "not specifically about cocaine" means that the word 'cocaine' did not show up in the description.
Expected Results


incident_number
datetime
category
description
resolution
pd_district


141101610
2014-12-31 21:51:00
DRUG/NARCOTIC
POSSESSION OF METH-AMPHETAMINE
ARREST, BOOKED
RICHMOND


141101444
2014-12-31 20:37:00
DRUG/NARCOTIC
POSSESSION OF MARIJUANA
ARREST, BOOKED
MISSION


141099429
2014-12-31 08:08:00
DRUG/NARCOTIC
POSSESSION OF CONTROLLED SUBSTANCE
ARREST, BOOKED
SOUTHERN


141099429
2014-12-31 08:08:00
DRUG/NARCOTIC
POSSESSION OF NARCOTICS PARAPHERNALIA
ARREST, BOOKED
SOUTHERN


141099350
2014-12-31 03:16:00
DRUG/NARCOTIC
POSSESSION OF MARIJUANA

TARAVAL


## 02-counting.md

      
    Raw
  

              02-counting.md
            
          
    SQL practice with SFPD incidents and simple aggregation (2 of 4)

Directions

This exercise set has 5 problems, not including two example problems (i.e. 1x and 5x).
Email me at dun@stanford.edu with the subject:
padjo-2017 homework sql-ex-2 your_sunet_id

With a Google Doc (attached or linked-to) in which each query has its own page.
The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:
http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip
Introduction

The SFPD data doesn't lend itself to much quantification, as the latitude and longitude columns are the only numerical columns, thus sum, average, max/min aren't very interesting aggregate methods.
So for this exercise, we just care about the COUNT function. When COUNT, or any aggregate function, is used without a GROUP BY clause, the result is a single value.
Pretend you had a spreadsheet and you used the =COUNTA() or =SUM() function across every row, i.e. =COUNTA(A1:A9999 and =SUM(A1:A9999). You would expect the function to deliver a single value.
The spreadsheet equivalent of "GROUP BY" is to do things using pivot tables. But for these SQL exercises, it's a singular count (i.e. not a pivot table, i.e. not using GROUP BY).
Besides the COUNT() function in the SELECT clause, all the other SQL syntax remains the same.
Related reading

Check out this past lesson, Aggregate functions in SQL. Specifically, focus on the The COUNT section and especially, the Counting without grouping section.  (Ignore the GROUP BY stuff for now)
Exercises

1x. (EXAMPLE) Just do a count

The syntax for calculating the count of rows in a result set is simply:
SELECT COUNT(*); 
Believe it or not, even without reference to the database table (i.e. incidents), this has a valid result:


COUNT(*)


1


This can be interpreted as something like, The number of rows in the result for our useless queries is: 1.
1. Total number of incidents

Using SELECT COUNT(*) with a FROM clause will count the total number of rows in the table.
Expected result


total_count


751047


Hints

The expected answer expects a column name of total_count, so your query should include the proper alias clause.
2. Total number of incidents categorized as "ASSAULT"

Same as previous question, except there is a WHERE clause and a different alias.
Expected result


total_assaults


63874


Query

(I'm just giving you this answer)
SELECT 
  COUNT(*) AS total_assaults
FROM 
  incidents
WHERE 
  category = 'ASSAULT';
3. Total number of incidents having something to do with marijuana

Assume "having something to do with marijuana" means that the literal text 'marijuana' shows up in the "description" field, i.e. you will want to use the LIKE and wildcards.
Read: Fuzzy matching in SQL, particularly Matching patterns with LIKE
Expected result


total_count


5589


4. Total number of incidents taking place in 2014

Expected result


total_count


150128


5x. (EXAMPLE) Total number of incidents that took place in the "TENDERLOIN" district that were either "UNFOUNDED" or "NOT PROSECUTED"

Nothing different from the previous exercises, just a slightly more complicated logic in the WHERE clause.
Expected result


total_count


504


Hints

It is very easy to mix up the logic here. For example, this is NOT the logic that we want for finding both kinds of "resolution":
  resolution = 'UNFOUNDED'
  AND resolution = 'NOT PROSECUTED'
That is mutually exclusive (do the logic in your head). We want to use OR:
  resolution = 'UNFOUNDED'
  OR resolution = 'NOT PROSECUTED'
Now look at the "pd_district" = 'TENDERLOIN' logic separately:
pd_district = 'TENDERLOIN'
How do we combine the two conditional clauses?
Without parentheses

(the exerpt below is obviously just an excerpt, not valid syntax)
  
  pd_district = 'TENDERLOIN'
  AND resolution = 'UNFOUNDED'
  OR resolution = 'NOT PROSECUTED'
The above logic returns results in which:

The district is "TENDERLOIN" and the resolution is "UNFOUNDED"
The resolution is "NOT PROSECUTED"

The results of #2 include rows from districts that are not the TENDERLOIND
Using parentheses to group logic

Using parentheses, as we do in arithmetic, we can enforce the order of the logic:
  
  pd_district = 'TENDERLOIN'
  AND
  (resolution = 'UNFOUNDED'
   OR resolution = 'NOT PROSECUTED')
Unlike the previous bad example, there are only 2 separate conditions that must both be met. The latter condition, dealing with the resolution, can have multiple choices. But in the end, all results are in the 'TENDERLOIN' district.
Query

SELECT 
  COUNT(*) AS total_count
FROM 
  incidents
WHERE 
    pd_district = 'TENDERLOIN'
    AND (resolution = 'UNFOUNDED' 
     OR resolution = 'NOT PROSECUTED');
5. Total number of incidents taking place in 2016 in either the "TENDERLOIN" or "MISSION" districts

Expected results


total_count


29488


## 03-grouping.md

      
    Raw
  

              03-grouping.md
            
          
    SQL practice with SFPD incidents and group by aggregation (3 of 4)

Directions

This exercise set has 9 problems, not including 3 example problems (2x, 5x, and 6x)
Email me at dun@stanford.edu with the subject:
padjo-2017 homework sql-ex-3 your_sunet_id

With a Google Doc (attached or linked-to) in which each query has its own page.
The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:
http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip
Introduction

Unlike the previous set of exercises, which practiced using the COUNT aggregation function to provide a single result value (i.e. the row count of the result set), these exercises use aggregation with the GROUP BY clause.
Instead of counting the number of results, we can get a count of results by group. Just like we can with pivot tables.
Related reading


Grouping in SQL
Aggregate functions in SQL

Exercises

1. Group by the resolution field, list in alphabetical order

Expected result


resolution


ARREST, BOOKED


ARREST, CITED


CLEARED-CONTACT JUVENILE FOR MORE INFO


COMPLAINANT REFUSES TO PROSECUTE


DISTRICT ATTORNEY REFUSES TO PROSECUTE


EXCEPTIONAL CLEARANCE


JUVENILE ADMONISHED


JUVENILE BOOKED


JUVENILE CITED


JUVENILE DIVERTED


LOCATED


NONE


NOT PROSECUTED


PROSECUTED BY OUTSIDE AGENCY


PROSECUTED FOR LESSER OFFENSE


PSYCHOPATHIC CASE


UNFOUNDED


Hints

This is pretty straightforward; requires a "GROUP BY" and "ORDER BY" clauses
2x. (EXAMPLE) Do a group count of the resolution field, list resolutions in alphabetical order

Expected results


resolution
total_count


ARREST, BOOKED
174718


ARREST, CITED
34180


CLEARED-CONTACT JUVENILE FOR MORE INFO
362


COMPLAINANT REFUSES TO PROSECUTE
4595


DISTRICT ATTORNEY REFUSES TO PROSECUTE
1055


EXCEPTIONAL CLEARANCE
2188


JUVENILE ADMONISHED
581


JUVENILE BOOKED
4781


JUVENILE CITED
1740


JUVENILE DIVERTED
190


LOCATED
10411


NONE
494882


NOT PROSECUTED
504


PROSECUTED BY OUTSIDE AGENCY
975


PROSECUTED FOR LESSER OFFENSE
5


PSYCHOPATHIC CASE
11563


UNFOUNDED
8317


Query

SELECT 
  resolution,
  COUNT(*) AS total_count
FROM
  incidents
GROUP BY 
  resolution
ORDER BY 
  resolution ASC;
2. List the 5 most frequently-occurring values of the resolution field

Same as previous question, just a different sort order, and a LIMIT clause.
Expected results


resolution
total_count


NONE
494882


ARREST, BOOKED
174718


ARREST, CITED
34180


PSYCHOPATHIC CASE
11563


LOCATED
10411


{:.table-sql}


3. List the top 10 most frequently-occurring values of the category field

Pretty much the exact same thing as the previous question, with a few details changed. See if you can write this out from scratch:
Expected results


category
total_count


LARCENY/THEFT
187882


OTHER OFFENSES
98907


NON-CRIMINAL
94489


ASSAULT
63874


VANDALISM
38162


VEHICLE THEFT
33896


WARRANTS
33163


BURGLARY
30241


SUSPICIOUS OCC
28062


DRUG/NARCOTIC
27125


4. List the top 10 most frequently occurring category and description combinations

Virtually the same as the prior question, except we're grouping by, and showing, two columns (besides the "total_count" aggregate column).
Expected results


category
description
total_count


LARCENY/THEFT
GRAND THEFT FROM LOCKED AUTO
73239


NON-CRIMINAL
LOST PROPERTY
23236


LARCENY/THEFT
PETTY THEFT OF PROPERTY
22825


OTHER OFFENSES
DRIVERS LICENSE, SUSPENDED OR REVOKED
22497


NON-CRIMINAL
AIDED CASE, MENTAL DISTURBED
22228


VEHICLE THEFT
STOLEN AUTOMOBILE
21383


ASSAULT
BATTERY
19121


WARRANTS
WARRANT ARREST
18235


LARCENY/THEFT
PETTY THEFT FROM LOCKED AUTO
18046


VANDALISM
MALICIOUS MISCHIEF, VANDALISM
16465


5x. (EXAMPLE) List all descriptions of incidents categorized as  "MISSING PERSON", in order of frequency

This is virtually the same as the prior exercise, but I make this an example to show something subtle about the results.
Expected results

This is what I want to see -- note that the "category" column is left out:


description
total_count


FOUND PERSON
10981


MISSING ADULT
7054


MISSING JUVENILE
4555


Hints

If you piggyback from the answer in the previous exercise, you might come up with this:
SELECT 
  category, description,
  COUNT(*) AS total_count
FROM
  incidents
WHERE 
  category = 'MISSING PERSON'
GROUP BY 
  category, description
ORDER BY 
  total_count DESC;
Which results in this:


category
description
total_count


MISSING PERSON
FOUND PERSON
10981


MISSING PERSON
MISSING ADULT
7054


MISSING PERSON
MISSING JUVENILE
4555


The "total_count" column ends up being exactly the same, so the calculations were the same. The only difference is in structure: your result has the "category" column and my expected result does not. Why do I prefer the latter?
Think about what is happening in the WHERE clause:
WHERE 
  category = 'MISSING PERSON'
By definition, all results are going to have the "category" of 'MISSING PERSON', i.e. every row will have 'MISSING PERSON' or else it wouldn't be included in the results. What's the purpose of doing a group count by "category" if we know there aren't different groups?
And since we aren't doing a GROUP BY that refers to the "category" column, best practice is to omit it from the SELECT clause too.
Query

Note how this query omits mention of the "category" column  for purposes of displaying (via SELECT) or grouping (via GROUP BY):
SELECT 
  description,
  COUNT(*) AS total_count
FROM
  incidents
WHERE 
  category = 'MISSING PERSON'
GROUP BY 
  description
ORDER BY 
  total_count DESC;
5.  List all descriptions of incidents categorized as either  "MISSING PERSON" or "TRESPASS", in order of frequency

This is a situation in which we do want to include "category" as it can take on more than one value (two, in this case).
Expected results

Without the  "category" column, we'd run into problems in which two "description" values had different categories and are meant to be treated differently. That isn't the case with the categories of "MISSING PERSON" and "TRESSPASS", but we have no idea the taxonomy used by the SFPD.


category
description
total_count


MISSING PERSON
FOUND PERSON
10981


MISSING PERSON
MISSING ADULT
7054


TRESPASS
TRESPASSING
5430


MISSING PERSON
MISSING JUVENILE
4555


TRESPASS
LODGING WITHOUT PERMISSION
1288


TRESPASS
LOITERING WITHOUT LAWFUL BUSINESS WITH OWNER OR OCCUPANT
30


TRESPASS
TRESPASSING ON RAILROAD TRAINS
7


TRESPASS
DISASTER AREA, ENTERING OR REMAINING IN
3


6x. (EXAMPLE) List total incidents per year

As before, treat the "datetime" column as a string, and use the SUBSTR function to select the first 4 characters and effectively create a new column. Alias that column as "year" and sort by that alias.
Expected results


year
total_count


2012
140847


2013
152806


2014
150128


2015
156530


2016
150736


Query

  SELECT 
    SUBSTR(datetime, 1, 4) AS year,
    COUNT(*) AS total_count
  FROM incidents
  GROUP BY year
  ORDER BY year ASC; 
6. List total incidents by month (ignoring year)

Sometimes it's interesting to count/group by a time cycle, such as by month (as opposed to year and month). This is the same as counting by year, except extracting a different part of the datetime string.
Expected results


month
total_count


01
63726


02
58656


03
63582


04
61351


05
63459


06
60730


07
62828


08
65574


09
63561


10
66932


11
61149


12
59499


Hints

If you can't figure out the arguments for the SUBSTR function to excerpt that middle part of the datetime value, try it out iteratively:
SELECT
  datetime,
  SUBSTR(datetime, 1, 10)
FROM incidents
LIMIT 5;


datetime
SUBSTR(datetime, 1, 10)


2012-01-01 00:01:00
2012-01-01


2012-01-01 00:01:00
2012-01-01


2012-01-01 00:01:00
2012-01-01


2012-01-01 00:01:00
2012-01-01


2012-01-01 00:01:00
2012-01-01


Keep trying until you get what you want:
SELECT
  datetime,
  SUBSTR(datetime, 2, 7)
FROM incidents
LIMIT 5;


datetime
SUBSTR(datetime, 2, 7)


2012-01-01 00:01:00
012-01-


2012-01-01 00:01:00
012-01-


2012-01-01 00:01:00
012-01-


2012-01-01 00:01:00
012-01-


2012-01-01 00:01:00
012-01-


7. Yearly count of incidents categorized as "DRUG/NARCOTIC"

Expected result


year
total_count


2012
6444


2013
6775


2014
5408


2015
4254


2016
4244


8. Yearly count of incidents having anything to do with "marijuana"

Pretend that there are "marijuana"-related incidents that are not just in the category of "DRUG/NARCOTIC". Assume that "having anything to do with 'marijuana'" means that the word 'marijuana' appears somewhere in the record.
Read: Fuzzy matching in SQL, particularly Matching patterns with LIKE
Expected results


year
total_count


2012
1395


2013
1371


2014
1191


2015
892


2016
740


9. Yearly count of "DRUG/NARCOTIC" incidents that do not mention marijuana

Answer is very similar to the previous one, but requires a WHERE condition to filter on the "category" and "description" columns.
Expected results


year
total_count


2012
5049


2013
5404


2014
4217


2015
3362


2016
3504


{:.table-sql}


## 04-having.md

      
    Raw
  

              04-having.md
            
          
    SQL practice with SFPD incidents and the HAVING clause (4 of 4)

Directions

This exercise set has 3 problems, not including 1 example problems, 1x.
Email me at dun@stanford.edu with the subject:
padjo-2017 homework sql-ex-4 your_sunet_id

With a Google Doc (attached or linked-to) in which each query has its own page.
The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:
http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip
Introduction

Related: Stack Overflow: Using group by and having clause
(warning: tortured analogy)
Understanding the HAVING clause doesn't require much more than understanding WHERE. However, the reason why HAVING exists is due to the internals (i.e. something we treat as a black box) of how SQL processes our query.
Imagine a table full of assorted fruit, bananas, oranges, kiwis, just all over the place.
If we wanted just the fruit that was bananas, we'd say something like:
Gather up the fruit that are bananas,
SELECT * 
FROM fruits 
WHERE type = 'banana';
What if we wanted to collect just the yellowish-fruits which as a group numbered at least 10 or more, i.e.
I want just the yellow fruits of which there are more than 10 of each on this table
In SQL, that query would look something like this:
SELECT type, 
  COUNT(*) AS quantity
FROM fruits
WHERE color = 'yellow'
GROUP BY type
HAVING quantity > 10;
In SQL, that quantity > 10 must be in the HAVING clause. Because the filtering conditions in the WHERE clause are done before the group-by-aggregation. The column "quantity" simply does not exist before the GROUP BY is done.
To go back to our table full of fruit analogy: it's possible to gather/discard fruit based on the color of 'yellow' before we start grouping the fruit into groups by type, i.e. 'lemon', 'bananas', etc. However, it is only that we've grouped the fruit together that we can count each type's quantity
1x. (EXAMPLE) List the categories in which the incident count is greater than 50,000

Expected results


category
total_count


LARCENY/THEFT
187882


OTHER OFFENSES
98907


NON-CRIMINAL
94489


ASSAULT
63874


Start off by doing a group count by grouping by the "category" column and doing a reverse sort of the count column (aliased as "total_count"):
SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
ORDER BY total_count DESC;
For the sake of brevity, here are the first 10 or so rows:


category
total_count


LARCENY/THEFT
187882


OTHER OFFENSES
98907


NON-CRIMINAL
94489


ASSAULT
63874


VANDALISM
38162


VEHICLE THEFT
33896


WARRANTS
33163


BURGLARY
30241


SUSPICIOUS OCC
28062


DRUG/NARCOTIC
27125


MISSING PERSON
22590


Hints

The point of the HAVING clause is to filter on the aggregated column, i.e. the "total_count" column. The aggregate happens because of the GROUP BY clause, so HAVING always comes after GROUP BY. Otherwise, it's not much different than WHERE
Query

SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
HAVING total_count > 50000
ORDER BY total_count DESC;
Further discussion

A common error is to try to use WHERE when only HAVING will do:
SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE total_count > 50000
GROUP BY category
ORDER BY total_count DESC;
You should get this error:
misuse of aggregate: COUNT()

Likewise, HAVING is easily confused with WHERE. The following query does a silly filter for rows in which "category" is equal to 'BURGLARY', i.e. all the results will be burglaries:
SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
GROUP BY category
HAVING category = 'BURGLARY'
ORDER BY total_count DESC;
However, this doesn't throw an error because HAVING works pretty much the same as WHERE:


category
total_count


BURGLARY
30241


This is equally nonsensical without throwing any errors:
SELECT 
  category,
  COUNT(*) AS total_count
FROM 
  incidents
WHERE category = 'BURGLARY'
GROUP BY category
HAVING category = 'BURGLARY'
ORDER BY total_count DESC;
It still works. But it is nonsensical in a redundant way, like saying:

I want to buy the cat with a hat that is a cat

To reduce confusion, best practice in terms of SQL style is to only do filtering on aggregate columns in the HAVING clause.
1. List in alphabetical order the descriptions that have more than 20,000 incidents

Expected results


description
total_count


AIDED CASE, MENTAL DISTURBED
22228


DRIVERS LICENSE, SUSPENDED OR REVOKED
22497


GRAND THEFT FROM LOCKED AUTO
73239


LOST PROPERTY
23236


PETTY THEFT OF PROPERTY
22825


STOLEN AUTOMOBILE
21383


2. For the year 2015, list the precincts that had more than 20,000 incidents, in order of most incidents

Expected results


pd_district
total_count


SOUTHERN
30095


NORTHERN
20089


3. List the top 5 descriptions that have more than 1,000 incidents for the years 2014 through 2016, sorted in order of shortest to longest description

This one is a bit of a trick question. Not much different from the previous questions, but uses a LENGTH function (to determine the length of a string) as a way to sort the results.
If you can get this, then you have a decent sense of how WHERE and GROUP BY are related but different...
Expected results


description
total_count


BATTERY
11786


AIDED CASE
3185


CONSPIRACY
1410


TRESPASSING
3497


CASE CLOSURE
2272


Hints

In case you need a reminder on boolean expressions:

Conditional filtering with WHERE
Fuzzy matching in SQL
Using the WHERE clause to filter data in SQL
More Boolean Expressions to Filter SQL Queries
Using LIKE, IN, BETWEEN, and wildcards to match multiple values in SQL
incident_number	datetime	category	description	resolution	pd_district
170000198	2016-12-31 23:58:00	DRIVING UNDER THE INFLUENCE	DRIVING WHILE UNDER THE INFLUENCE OF DRUGS	ARREST, BOOKED	MISSION
{:.table-sql}
incident_number	datetime	category	description	pd_district
120001613	2012-01-01 02:30:00	VEHICLE THEFT	STOLEN AUTOMOBILE	INGLESIDE
120002257	2012-01-01 03:00:00	VEHICLE THEFT	STOLEN AUTOMOBILE	SOUTHERN
120135751	2012-01-01 08:00:00	VEHICLE THEFT	STOLEN MOTORCYCLE	INGLESIDE
120003368	2012-01-01 09:00:00	VEHICLE THEFT	STOLEN AUTOMOBILE	INGLESIDE
120498349	2012-01-01 12:00:00	VEHICLE THEFT	STOLEN MOTORCYCLE	INGLESIDE
incident_number	datetime	category	description	resolution	pd_district
120002837	2012-01-01 23:05:00	VANDALISM	MALICIOUS MISCHIEF, VANDALISM OF VEHICLES		INGLESIDE
{:.table-sql}
datetime	SUBSTR(datetime, 1, 4)
2012-01-01 00:01:00	2012
2012-01-01 00:01:00	2012
2012-01-01 00:01:00	2012
2012-01-01 00:01:00	2012
2012-01-01 00:01:00	2012
datetime	SUBSTR(datetime, 1, 4)	SUBSTR(datetime, 1, 7)
2012-01-01 00:01:00	2012	2012-01
2012-01-01 00:01:00	2012	2012-01
2012-01-01 00:01:00	2012	2012-01
2012-01-01 00:01:00	2012	2012-01
2012-01-01 00:01:00	2012	2012-01
{:.table-sql}
incident_number	datetime	category	description	resolution	pd_district
140592131	2014-01-01 00:01:00	SEX OFFENSES, FORCIBLE	SODOMY	ARREST, BOOKED	MISSION
140882346	2014-01-01 00:01:00	FRAUD	FRAUDULENT USE OF AUTOMATED TELLER CARD		NORTHERN
140309936	2014-01-01 00:01:00	OTHER OFFENSES	FALSE PERSONATION TO RECEIVE MONEY OR PROPERTY	UNFOUNDED	RICHMOND
140471381	2014-01-01 00:01:00	FRAUD	CREDIT CARD, THEFT BY USE OF		CENTRAL
140057743	2014-01-01 00:01:00	SUSPICIOUS OCC	SUSPICIOUS OCCURRENCE		NORTHERN
incident_number	datetime	category	description	resolution	pd_district
120005342	2012-01-03 06:42:00	VEHICLE THEFT	ATTEMPTED STOLEN VEHICLE	ARREST, BOOKED	MISSION
120011355	2012-01-04 21:00:00	VEHICLE THEFT	STOLEN AUTOMOBILE	UNFOUNDED	SOUTHERN
120016656	2012-01-05 16:00:00	VEHICLE THEFT	STOLEN AUTOMOBILE	ARREST, BOOKED	RICHMOND
120017604	2012-01-06 10:30:00	VEHICLE THEFT	STOLEN TRUCK	ARREST, BOOKED	INGLESIDE
120017165	2012-01-06 21:45:00	VEHICLE THEFT	STOLEN AUTOMOBILE	UNFOUNDED	INGLESIDE
incident_number	datetime	category	description	resolution	pd_district
141098374	2014-12-30 18:55:00	DRUG/NARCOTIC	POSSESSION OF BASE/ROCK COCAINE FOR SALE	ARREST, BOOKED	TENDERLOIN
141096942	2014-12-30 11:30:00	DRUG/NARCOTIC	POSSESSION OF BASE/ROCK COCAINE FOR SALE	ARREST, BOOKED	SOUTHERN
141094366	2014-12-29 15:10:00	DRUG/NARCOTIC	POSSESSION OF BASE/ROCK COCAINE		TENDERLOIN
141094366	2014-12-29 15:10:00	DRUG/NARCOTIC	SALE OF BASE/ROCK COCAINE		TENDERLOIN
141092928	2014-12-29 03:15:00	DRUG/NARCOTIC	POSSESSION OF COCAINE		INGLESIDE
incident_number	datetime	category	description	resolution	pd_district
141101610	2014-12-31 21:51:00	DRUG/NARCOTIC	POSSESSION OF METH-AMPHETAMINE	ARREST, BOOKED	RICHMOND
141101444	2014-12-31 20:37:00	DRUG/NARCOTIC	POSSESSION OF MARIJUANA	ARREST, BOOKED	MISSION
141099429	2014-12-31 08:08:00	DRUG/NARCOTIC	POSSESSION OF CONTROLLED SUBSTANCE	ARREST, BOOKED	SOUTHERN
141099429	2014-12-31 08:08:00	DRUG/NARCOTIC	POSSESSION OF NARCOTICS PARAPHERNALIA	ARREST, BOOKED	SOUTHERN
141099350	2014-12-31 03:16:00	DRUG/NARCOTIC	POSSESSION OF MARIJUANA		TARAVAL
resolution	total_count
ARREST, BOOKED	174718
ARREST, CITED	34180
CLEARED-CONTACT JUVENILE FOR MORE INFO	362
COMPLAINANT REFUSES TO PROSECUTE	4595
DISTRICT ATTORNEY REFUSES TO PROSECUTE	1055
EXCEPTIONAL CLEARANCE	2188
JUVENILE ADMONISHED	581
JUVENILE BOOKED	4781
JUVENILE CITED	1740
JUVENILE DIVERTED	190
LOCATED	10411
NONE	494882
NOT PROSECUTED	504
PROSECUTED BY OUTSIDE AGENCY	975
PROSECUTED FOR LESSER OFFENSE	5
PSYCHOPATHIC CASE	11563
UNFOUNDED	8317