You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each exercise section has a set of exercises with a (perhaps too vague) prompt, like:
Total number of incidents categorized as "ASSAULT"
And the expected results from a SQL query:
total_assaults
63874
Your job is to: write the SQL query that answers the prompt and gets ther results.
In this case:
SELECTCOUNT(*) AS total_assaults
FROM
incidents
WHERE
category ='ASSAULT';
For each exercise section, you'll email me a Word/Google doc in which each page has its own query.
Btw, the above example was the actual answer for Set 2, Exercise 2, but please write it out for yourself and make sure it works before just handing it in.
This is an overview lesson that covers all of the syntax needed to find and sort records in a database before getting into the GROUP BY clause.
Think of these exercises as covering database basics before getting into the aggregation functionality of GROUP BY. It's just like wanting to cover the many features of spreadsheets before getting into pivot tables and aggregation.
Directions
This exercise set has 6 problems, not including two example problems (i.e. 1x and 4x).
It is an overwhelming number of concepts and syntax, but mastery isn't expected, just familiarity for now.
Exercises
1x. (EXAMPLE) Show just the datetime column from the very first incident in the database
"Very first incident" in the database basically means, whatever the first row is in the data table without an additional sorting of our own. Use a LIMIT clause to make sure the results only include a single row.
The key to this exercise is to realize that the "datetime" field is a text field, as far as SQLite is concerned. Which means we can use text functions to extract dates and do comparisons.
Try:
SELECT datetime
FROM incidents
LIMIT5;
Then:
SELECT datetime,
SUBSTR(datetime, 1, 4)
FROM incidents
LIMIT5;
What might be hard to figure out is how to set the condition to limit records in which the "datetime" value (which is not just a date but a time) is "equal" to the first day of January 2012.
This won't work:
SELECT
datetime
FROM incidents
WHERE datetime ='2012-01-01'LIMIT1;
Though using comparisons might work:
SELECT
datetime
FROM incidents
WHERE datetime >'2012-01-01'AND datetime <'2012-01-02'LIMIT1;
There's all kind of variety:
SELECT
datetime
FROM incidents
WHERE datetime
BETWEEN '2012-01-01'AND'2012-01-02'LIMIT1;
Query
SELECT
incident_number,
datetime,
category,
description,
resolution,
pd_district
FROM
incidents
WHERE
SUBSTR(datetime, 1, 10) ='2012-01-01'ORDER BY
datetime DESCLIMIT1;
4. Find the earliest 5 incidents of the year 2014
Expected results
incident_number
datetime
category
description
resolution
pd_district
140592131
2014-01-01 00:01:00
SEX OFFENSES, FORCIBLE
SODOMY
ARREST, BOOKED
MISSION
140882346
2014-01-01 00:01:00
FRAUD
FRAUDULENT USE OF AUTOMATED TELLER CARD
NORTHERN
140309936
2014-01-01 00:01:00
OTHER OFFENSES
FALSE PERSONATION TO RECEIVE MONEY OR PROPERTY
UNFOUNDED
RICHMOND
140471381
2014-01-01 00:01:00
FRAUD
CREDIT CARD, THEFT BY USE OF
CENTRAL
140057743
2014-01-01 00:01:00
SUSPICIOUS OCC
SUSPICIOUS OCCURRENCE
NORTHERN
5. Find the earliest 5 chronological vehicle thefts that had a resolution other than "NONE"
Expected results
incident_number
datetime
category
description
resolution
pd_district
120005342
2012-01-03 06:42:00
VEHICLE THEFT
ATTEMPTED STOLEN VEHICLE
ARREST, BOOKED
MISSION
120011355
2012-01-04 21:00:00
VEHICLE THEFT
STOLEN AUTOMOBILE
UNFOUNDED
SOUTHERN
120016656
2012-01-05 16:00:00
VEHICLE THEFT
STOLEN AUTOMOBILE
ARREST, BOOKED
RICHMOND
120017604
2012-01-06 10:30:00
VEHICLE THEFT
STOLEN TRUCK
ARREST, BOOKED
INGLESIDE
120017165
2012-01-06 21:45:00
VEHICLE THEFT
STOLEN AUTOMOBILE
UNFOUNDED
INGLESIDE
6. Find the last 5 incidents of the year 2014 that had something to do with cocaine
The SFPD data doesn't lend itself to much quantification, as the latitude and longitude columns are the only numerical columns, thus sum, average, max/min aren't very interesting aggregate methods.
So for this exercise, we just care about the COUNT function. When COUNT, or any aggregate function, is used without a GROUP BY clause, the result is a single value.
Pretend you had a spreadsheet and you used the =COUNTA() or =SUM() function across every row, i.e. =COUNTA(A1:A9999 and =SUM(A1:A9999). You would expect the function to deliver a single value.
The spreadsheet equivalent of "GROUP BY" is to do things using pivot tables. But for these SQL exercises, it's a singular count (i.e. not a pivot table, i.e. not using GROUP BY).
Besides the COUNT() function in the SELECT clause, all the other SQL syntax remains the same.
Related reading
Check out this past lesson, Aggregate functions in SQL. Specifically, focus on the The COUNT section and especially, the Counting without grouping section. (Ignore the GROUP BY stuff for now)
Exercises
1x. (EXAMPLE) Just do a count
The syntax for calculating the count of rows in a result set is simply:
SELECTCOUNT(*);
Believe it or not, even without reference to the database table (i.e. incidents), this has a valid result:
COUNT(*)
1
This can be interpreted as something like, The number of rows in the result for our useless queries is: 1.
1. Total number of incidents
Using SELECT COUNT(*) with a FROM clause will count the total number of rows in the table.
Expected result
total_count
751047
Hints
The expected answer expects a column name of total_count, so your query should include the proper alias clause.
2. Total number of incidents categorized as "ASSAULT"
Same as previous question, except there is a WHERE clause and a different alias.
Expected result
total_assaults
63874
Query
(I'm just giving you this answer)
SELECTCOUNT(*) AS total_assaults
FROM
incidents
WHERE
category ='ASSAULT';
3. Total number of incidents having something to do with marijuana
Assume "having something to do with marijuana" means that the literal text 'marijuana' shows up in the "description" field, i.e. you will want to use the LIKE and wildcards.
Unlike the previous bad example, there are only 2 separate conditions that must both be met. The latter condition, dealing with the resolution, can have multiple choices. But in the end, all results are in the 'TENDERLOIN' district.
Query
SELECTCOUNT(*) AS total_count
FROM
incidents
WHERE
pd_district ='TENDERLOIN'AND (resolution ='UNFOUNDED'OR resolution ='NOT PROSECUTED');
5. Total number of incidents taking place in 2016 in either the "TENDERLOIN" or "MISSION" districts
Unlike the previous set of exercises, which practiced using the COUNT aggregation function to provide a single result value (i.e. the row count of the result set), these exercises use aggregation with the GROUP BY clause.
Instead of counting the number of results, we can get a count of results by group. Just like we can with pivot tables.
1. Group by the resolution field, list in alphabetical order
Expected result
resolution
ARREST, BOOKED
ARREST, CITED
CLEARED-CONTACT JUVENILE FOR MORE INFO
COMPLAINANT REFUSES TO PROSECUTE
DISTRICT ATTORNEY REFUSES TO PROSECUTE
EXCEPTIONAL CLEARANCE
JUVENILE ADMONISHED
JUVENILE BOOKED
JUVENILE CITED
JUVENILE DIVERTED
LOCATED
NONE
NOT PROSECUTED
PROSECUTED BY OUTSIDE AGENCY
PROSECUTED FOR LESSER OFFENSE
PSYCHOPATHIC CASE
UNFOUNDED
Hints
This is pretty straightforward; requires a "GROUP BY" and "ORDER BY" clauses
2x. (EXAMPLE) Do a group count of the resolution field, list resolutions in alphabetical order
Expected results
resolution
total_count
ARREST, BOOKED
174718
ARREST, CITED
34180
CLEARED-CONTACT JUVENILE FOR MORE INFO
362
COMPLAINANT REFUSES TO PROSECUTE
4595
DISTRICT ATTORNEY REFUSES TO PROSECUTE
1055
EXCEPTIONAL CLEARANCE
2188
JUVENILE ADMONISHED
581
JUVENILE BOOKED
4781
JUVENILE CITED
1740
JUVENILE DIVERTED
190
LOCATED
10411
NONE
494882
NOT PROSECUTED
504
PROSECUTED BY OUTSIDE AGENCY
975
PROSECUTED FOR LESSER OFFENSE
5
PSYCHOPATHIC CASE
11563
UNFOUNDED
8317
Query
SELECT
resolution,
COUNT(*) AS total_count
FROM
incidents
GROUP BY
resolution
ORDER BY
resolution ASC;
2. List the 5 most frequently-occurring values of the resolution field
Same as previous question, just a different sort order, and a LIMIT clause.
Expected results
resolution
total_count
NONE
494882
ARREST, BOOKED
174718
ARREST, CITED
34180
PSYCHOPATHIC CASE
11563
LOCATED
10411
{:.table-sql}
3. List the top 10 most frequently-occurring values of the category field
Pretty much the exact same thing as the previous question, with a few details changed. See if you can write this out from scratch:
Expected results
category
total_count
LARCENY/THEFT
187882
OTHER OFFENSES
98907
NON-CRIMINAL
94489
ASSAULT
63874
VANDALISM
38162
VEHICLE THEFT
33896
WARRANTS
33163
BURGLARY
30241
SUSPICIOUS OCC
28062
DRUG/NARCOTIC
27125
4. List the top 10 most frequently occurring category and description combinations
Virtually the same as the prior question, except we're grouping by, and showing, two columns (besides the "total_count" aggregate column).
Expected results
category
description
total_count
LARCENY/THEFT
GRAND THEFT FROM LOCKED AUTO
73239
NON-CRIMINAL
LOST PROPERTY
23236
LARCENY/THEFT
PETTY THEFT OF PROPERTY
22825
OTHER OFFENSES
DRIVERS LICENSE, SUSPENDED OR REVOKED
22497
NON-CRIMINAL
AIDED CASE, MENTAL DISTURBED
22228
VEHICLE THEFT
STOLEN AUTOMOBILE
21383
ASSAULT
BATTERY
19121
WARRANTS
WARRANT ARREST
18235
LARCENY/THEFT
PETTY THEFT FROM LOCKED AUTO
18046
VANDALISM
MALICIOUS MISCHIEF, VANDALISM
16465
5x. (EXAMPLE) List all descriptions of incidents categorized as "MISSING PERSON", in order of frequency
This is virtually the same as the prior exercise, but I make this an example to show something subtle about the results.
Expected results
This is what I want to see -- note that the "category" column is left out:
description
total_count
FOUND PERSON
10981
MISSING ADULT
7054
MISSING JUVENILE
4555
Hints
If you piggyback from the answer in the previous exercise, you might come up with this:
SELECT
category, description,
COUNT(*) AS total_count
FROM
incidents
WHERE
category ='MISSING PERSON'GROUP BY
category, description
ORDER BY
total_count DESC;
Which results in this:
category
description
total_count
MISSING PERSON
FOUND PERSON
10981
MISSING PERSON
MISSING ADULT
7054
MISSING PERSON
MISSING JUVENILE
4555
The "total_count" column ends up being exactly the same, so the calculations were the same. The only difference is in structure: your result has the "category" column and my expected result does not. Why do I prefer the latter?
Think about what is happening in the WHERE clause:
WHERE
category ='MISSING PERSON'
By definition, all results are going to have the "category" of 'MISSING PERSON', i.e. every row will have 'MISSING PERSON' or else it wouldn't be included in the results. What's the purpose of doing a group count by "category" if we know there aren't different groups?
And since we aren't doing a GROUP BY that refers to the "category" column, best practice is to omit it from the SELECT clause too.
Query
Note how this query omits mention of the "category" column for purposes of displaying (via SELECT) or grouping (via GROUP BY):
SELECT
description,
COUNT(*) AS total_count
FROM
incidents
WHERE
category ='MISSING PERSON'GROUP BY
description
ORDER BY
total_count DESC;
5. List all descriptions of incidents categorized as either "MISSING PERSON" or "TRESPASS", in order of frequency
This is a situation in which we do want to include "category" as it can take on more than one value (two, in this case).
Expected results
Without the "category" column, we'd run into problems in which two "description" values had different categories and are meant to be treated differently. That isn't the case with the categories of "MISSING PERSON" and "TRESSPASS", but we have no idea the taxonomy used by the SFPD.
category
description
total_count
MISSING PERSON
FOUND PERSON
10981
MISSING PERSON
MISSING ADULT
7054
TRESPASS
TRESPASSING
5430
MISSING PERSON
MISSING JUVENILE
4555
TRESPASS
LODGING WITHOUT PERMISSION
1288
TRESPASS
LOITERING WITHOUT LAWFUL BUSINESS WITH OWNER OR OCCUPANT
30
TRESPASS
TRESPASSING ON RAILROAD TRAINS
7
TRESPASS
DISASTER AREA, ENTERING OR REMAINING IN
3
6x. (EXAMPLE) List total incidents per year
As before, treat the "datetime" column as a string, and use the SUBSTR function to select the first 4 characters and effectively create a new column. Alias that column as "year" and sort by that alias.
Expected results
year
total_count
2012
140847
2013
152806
2014
150128
2015
156530
2016
150736
Query
SELECT
SUBSTR(datetime, 1, 4) AS year,
COUNT(*) AS total_count
FROM incidents
GROUP BY year
ORDER BY year ASC;
6. List total incidents by month (ignoring year)
Sometimes it's interesting to count/group by a time cycle, such as by month (as opposed to year and month). This is the same as counting by year, except extracting a different part of the datetime string.
Expected results
month
total_count
01
63726
02
58656
03
63582
04
61351
05
63459
06
60730
07
62828
08
65574
09
63561
10
66932
11
61149
12
59499
Hints
If you can't figure out the arguments for the SUBSTR function to excerpt that middle part of the datetime value, try it out iteratively:
SELECT
datetime,
SUBSTR(datetime, 1, 10)
FROM incidents
LIMIT5;
datetime
SUBSTR(datetime, 1, 10)
2012-01-01 00:01:00
2012-01-01
2012-01-01 00:01:00
2012-01-01
2012-01-01 00:01:00
2012-01-01
2012-01-01 00:01:00
2012-01-01
2012-01-01 00:01:00
2012-01-01
Keep trying until you get what you want:
SELECT
datetime,
SUBSTR(datetime, 2, 7)
FROM incidents
LIMIT5;
datetime
SUBSTR(datetime, 2, 7)
2012-01-01 00:01:00
012-01-
2012-01-01 00:01:00
012-01-
2012-01-01 00:01:00
012-01-
2012-01-01 00:01:00
012-01-
2012-01-01 00:01:00
012-01-
7. Yearly count of incidents categorized as "DRUG/NARCOTIC"
Expected result
year
total_count
2012
6444
2013
6775
2014
5408
2015
4254
2016
4244
8. Yearly count of incidents having anything to do with "marijuana"
Pretend that there are "marijuana"-related incidents that are not just in the category of "DRUG/NARCOTIC". Assume that "having anything to do with 'marijuana'" means that the word 'marijuana' appears somewhere in the record.
Understanding the HAVING clause doesn't require much more than understanding WHERE. However, the reason why HAVING exists is due to the internals (i.e. something we treat as a black box) of how SQL processes our query.
Imagine a table full of assorted fruit, bananas, oranges, kiwis, just all over the place.
If we wanted just the fruit that was bananas, we'd say something like:
Gather up the fruit that are bananas,
SELECT*FROM fruits
WHERE type ='banana';
What if we wanted to collect just the yellowish-fruits which as a group numbered at least 10 or more, i.e.
I want just the yellow fruits of which there are more than 10 of each on this table
In SQL, that query would look something like this:
SELECT type,
COUNT(*) AS quantity
FROM fruits
WHERE color ='yellow'GROUP BY type
HAVING quantity >10;
In SQL, that quantity > 10must be in the HAVING clause. Because the filtering conditions in the WHERE clause are done before the group-by-aggregation. The column "quantity" simply does not exist before the GROUP BY is done.
To go back to our table full of fruit analogy: it's possible to gather/discard fruit based on the color of 'yellow' before we start grouping the fruit into groups by type, i.e. 'lemon', 'bananas', etc. However, it is only that we've grouped the fruit together that we can count each type's quantity
1x. (EXAMPLE) List the categories in which the incident count is greater than 50,000
Expected results
category
total_count
LARCENY/THEFT
187882
OTHER OFFENSES
98907
NON-CRIMINAL
94489
ASSAULT
63874
Start off by doing a group count by grouping by the "category" column and doing a reverse sort of the count column (aliased as "total_count"):
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
GROUP BY category
ORDER BY total_count DESC;
For the sake of brevity, here are the first 10 or so rows:
category
total_count
LARCENY/THEFT
187882
OTHER OFFENSES
98907
NON-CRIMINAL
94489
ASSAULT
63874
VANDALISM
38162
VEHICLE THEFT
33896
WARRANTS
33163
BURGLARY
30241
SUSPICIOUS OCC
28062
DRUG/NARCOTIC
27125
MISSING PERSON
22590
Hints
The point of the HAVING clause is to filter on the aggregated column, i.e. the "total_count" column. The aggregate happens because of the GROUP BY clause, so HAVING always comes after GROUP BY. Otherwise, it's not much different than WHERE
Query
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
GROUP BY category
HAVING total_count >50000ORDER BY total_count DESC;
Further discussion
A common error is to try to use WHERE when only HAVING will do:
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
WHERE total_count >50000GROUP BY category
ORDER BY total_count DESC;
You should get this error:
misuse of aggregate: COUNT()
Likewise, HAVING is easily confused with WHERE. The following query does a silly filter for rows in which "category" is equal to 'BURGLARY', i.e. all the results will be burglaries:
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
GROUP BY category
HAVING category ='BURGLARY'ORDER BY total_count DESC;
However, this doesn't throw an error because HAVING works pretty much the same as WHERE:
category
total_count
BURGLARY
30241
This is equally nonsensical without throwing any errors:
SELECT
category,
COUNT(*) AS total_count
FROM
incidents
WHERE category ='BURGLARY'GROUP BY category
HAVING category ='BURGLARY'ORDER BY total_count DESC;
It still works. But it is nonsensical in a redundant way, like saying:
I want to buy the cat with a hat that is a cat
To reduce confusion, best practice in terms of SQL style is to only do filtering on aggregate columns in the HAVING clause.
1. List in alphabetical order the descriptions that have more than 20,000 incidents
Expected results
description
total_count
AIDED CASE, MENTAL DISTURBED
22228
DRIVERS LICENSE, SUSPENDED OR REVOKED
22497
GRAND THEFT FROM LOCKED AUTO
73239
LOST PROPERTY
23236
PETTY THEFT OF PROPERTY
22825
STOLEN AUTOMOBILE
21383
2. For the year 2015, list the precincts that had more than 20,000 incidents, in order of most incidents
Expected results
pd_district
total_count
SOUTHERN
30095
NORTHERN
20089
3. List the top 5 descriptions that have more than 1,000 incidents for the years 2014 through 2016, sorted in order of shortest to longest description
This one is a bit of a trick question. Not much different from the previous questions, but uses a LENGTH function (to determine the length of a string) as a way to sort the results.
If you can get this, then you have a decent sense of how WHERE and GROUP BY are related but different...
Expected results
description
total_count
BATTERY
11786
AIDED CASE
3185
CONSPIRACY
1410
TRESPASSING
3497
CASE CLOSURE
2272
Hints
In case you need a reminder on boolean expressions: