These are the answers to:
https://gist.github.com/dannguyen/7e4fcfc6b91dd7749a23560912ee4e1e#file-02-counting-md
This exercise set has 5 problems, not including two example problems (i.e. 1x and 5x).
Email me at dun@stanford.edu with the subject:
padjo-2017 homework **sql-ex-2** **your_sunet_id**
With a Google Doc (attached or linked-to) in which each query has its own page.
The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:
http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip
The SFPD data doesn't lend itself to much quantification, as the latitude and longitude columns are the only numerical columns, thus sum, average, max/min aren't very interesting aggregate methods.
So for this exercise, we just care about the COUNT function. When COUNT, or any aggregate function, is used without a GROUP BY clause, the result is a single value.
Pretend you had a spreadsheet and you used the =COUNTA()
or =SUM()
function across every row, i.e. =COUNTA(A1:A9999
and =SUM(A1:A9999)
. You would expect the function to deliver a single value.
The spreadsheet equivalent of "GROUP BY" is to do things using pivot tables. But for these SQL exercises, it's a singular count (i.e. not a pivot table, i.e. not using GROUP BY).
Besides the COUNT() function in the SELECT clause, all the other SQL syntax remains the same.
Check out this past lesson, Aggregate functions in SQL. Specifically, focus on the The COUNT section and especially, the Counting without grouping section. (Ignore the GROUP BY stuff for now)
The syntax for calculating the count of rows in a result set is simply:
SELECT COUNT(*);
Believe it or not, even without reference to the database table (i.e. incidents
), this has a valid result:
COUNT(*) |
---|
1 |
This can be interpreted as something like, The number of rows in the result for our useless queries is: 1.
Using SELECT COUNT(*)
with a FROM
clause will count the total number of rows in the table.
total_count |
---|
751047 |
The expected answer expects a column name of total_count, so your query should include the proper alias clause.
Same as previous question, except there is a WHERE clause and a different alias.
total_assaults |
---|
63874 |
(I'm just giving you this answer)
SELECT
COUNT(*) AS total_assaults
FROM
incidents
WHERE
category = 'ASSAULT';
Assume "having something to do with marijuana" means that the literal text 'marijuana' shows up in the "description" field, i.e. you will want to use the LIKE and wildcards.
Read: Fuzzy matching in SQL, particularly Matching patterns with LIKE
total_count |
---|
5589 |
total_count |
---|
150128 |
5x. (EXAMPLE) Total number of incidents that took place in the "TENDERLOIN" district that were either "UNFOUNDED" or "NOT PROSECUTED"
Nothing different from the previous exercises, just a slightly more complicated logic in the WHERE clause.
total_count |
---|
504 |
It is very easy to mix up the logic here. For example, this is NOT the logic that we want for finding both kinds of "resolution":
resolution = 'UNFOUNDED'
AND resolution = 'NOT PROSECUTED'
That is mutually exclusive (do the logic in your head). We want to use OR:
resolution = 'UNFOUNDED'
OR resolution = 'NOT PROSECUTED'
Now look at the "pd_district" = 'TENDERLOIN'
logic separately:
pd_district = 'TENDERLOIN'
How do we combine the two conditional clauses?
(the exerpt below is obviously just an excerpt, not valid syntax)
pd_district = 'TENDERLOIN'
AND resolution = 'UNFOUNDED'
OR resolution = 'NOT PROSECUTED'
The above logic returns results in which:
- The district is "TENDERLOIN" and the resolution is "UNFOUNDED"
- The resolution is "NOT PROSECUTED"
The results of #2 include rows from districts that are not the TENDERLOIND
Using parentheses, as we do in arithmetic, we can enforce the order of the logic:
pd_district = 'TENDERLOIN'
AND
(resolution = 'UNFOUNDED'
OR resolution = 'NOT PROSECUTED')
Unlike the previous bad example, there are only 2 separate conditions that must both be met. The latter condition, dealing with the resolution, can have multiple choices. But in the end, all results are in the 'TENDERLOIN' district.
SELECT
COUNT(*) AS total_count
FROM
incidents
WHERE
pd_district = 'TENDERLOIN'
AND (resolution = 'UNFOUNDED'
OR resolution = 'NOT PROSECUTED');
total_count |
---|
29488 |
SELECT
COUNT(*) AS total_count
FROM
incidents
WHERE
SUBSTR(datetime, 1, 4) = '2016'
AND
pd_district IN ('TENDERLOIN', 'MISSION');