Instantly share code, notes, and snippets.

Embed
What would you like to do?

SQL practice with SFPD incidents and simple aggregation (2 of 4)

These are the answers to:

https://gist.github.com/dannguyen/7e4fcfc6b91dd7749a23560912ee4e1e#file-02-counting-md

Directions

This exercise set has 5 problems, not including two example problems (i.e. 1x and 5x).

Email me at dun@stanford.edu with the subject:

padjo-2017 homework **sql-ex-2** **your_sunet_id**

With a Google Doc (attached or linked-to) in which each query has its own page.

The file to use, as before, is this (zipped) SQLite file of SFPD incidents from 2012 through 2016:

http://stash.padjo.org/data/sfpd-incidents-2012-2016.sqlite.zip

Introduction

The SFPD data doesn't lend itself to much quantification, as the latitude and longitude columns are the only numerical columns, thus sum, average, max/min aren't very interesting aggregate methods.

So for this exercise, we just care about the COUNT function. When COUNT, or any aggregate function, is used without a GROUP BY clause, the result is a single value.

Pretend you had a spreadsheet and you used the =COUNTA() or =SUM() function across every row, i.e. =COUNTA(A1:A9999 and =SUM(A1:A9999). You would expect the function to deliver a single value.

The spreadsheet equivalent of "GROUP BY" is to do things using pivot tables. But for these SQL exercises, it's a singular count (i.e. not a pivot table, i.e. not using GROUP BY).

Besides the COUNT() function in the SELECT clause, all the other SQL syntax remains the same.

Related reading

Check out this past lesson, Aggregate functions in SQL. Specifically, focus on the The COUNT section and especially, the Counting without grouping section. (Ignore the GROUP BY stuff for now)

Exercises

1x. (EXAMPLE) Just do a count

The syntax for calculating the count of rows in a result set is simply:

SELECT COUNT(*); 

Believe it or not, even without reference to the database table (i.e. incidents), this has a valid result:

COUNT(*)
1

This can be interpreted as something like, The number of rows in the result for our useless queries is: 1.

1. Total number of incidents

Using SELECT COUNT(*) with a FROM clause will count the total number of rows in the table.

Expected result

total_count
751047

Hints

The expected answer expects a column name of total_count, so your query should include the proper alias clause.

2. Total number of incidents categorized as "ASSAULT"

Same as previous question, except there is a WHERE clause and a different alias.

Expected result

total_assaults
63874

Query

(I'm just giving you this answer)

SELECT 
  COUNT(*) AS total_assaults
FROM 
  incidents
WHERE 
  category = 'ASSAULT';

3. Total number of incidents having something to do with marijuana

Assume "having something to do with marijuana" means that the literal text 'marijuana' shows up in the "description" field, i.e. you will want to use the LIKE and wildcards.

Read: Fuzzy matching in SQL, particularly Matching patterns with LIKE

Expected result

total_count
5589

4. Total number of incidents taking place in 2014

Expected result

total_count
150128

5x. (EXAMPLE) Total number of incidents that took place in the "TENDERLOIN" district that were either "UNFOUNDED" or "NOT PROSECUTED"

Nothing different from the previous exercises, just a slightly more complicated logic in the WHERE clause.

Expected result

total_count
504

Hints

It is very easy to mix up the logic here. For example, this is NOT the logic that we want for finding both kinds of "resolution":

  resolution = 'UNFOUNDED'
  AND resolution = 'NOT PROSECUTED'

That is mutually exclusive (do the logic in your head). We want to use OR:

  resolution = 'UNFOUNDED'
  OR resolution = 'NOT PROSECUTED'

Now look at the "pd_district" = 'TENDERLOIN' logic separately:

pd_district = 'TENDERLOIN'

How do we combine the two conditional clauses?

Without parentheses

(the exerpt below is obviously just an excerpt, not valid syntax)

  
  pd_district = 'TENDERLOIN'
  AND resolution = 'UNFOUNDED'
  OR resolution = 'NOT PROSECUTED'

The above logic returns results in which:

  1. The district is "TENDERLOIN" and the resolution is "UNFOUNDED"
  2. The resolution is "NOT PROSECUTED"

The results of #2 include rows from districts that are not the TENDERLOIND

Using parentheses to group logic

Using parentheses, as we do in arithmetic, we can enforce the order of the logic:

  
  pd_district = 'TENDERLOIN'
  AND
  (resolution = 'UNFOUNDED'
   OR resolution = 'NOT PROSECUTED')

Unlike the previous bad example, there are only 2 separate conditions that must both be met. The latter condition, dealing with the resolution, can have multiple choices. But in the end, all results are in the 'TENDERLOIN' district.

Query

SELECT 
  COUNT(*) AS total_count
FROM 
  incidents
WHERE 
    pd_district = 'TENDERLOIN'
    AND (resolution = 'UNFOUNDED' 
     OR resolution = 'NOT PROSECUTED');

5. Total number of incidents taking place in 2016 in either the "TENDERLOIN" or "MISSION" districts

Expected results

total_count
29488

Query

SELECT 
  COUNT(*) AS total_count
FROM 
  incidents
WHERE
    SUBSTR(datetime, 1, 4) = '2016' 
    AND
    pd_district IN ('TENDERLOIN', 'MISSION');
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment