Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Created October 26, 2016 01:53
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dannguyen/b7bf6a794c3cb760ed32c3b0e4b29c27 to your computer and use it in GitHub Desktop.
Save dannguyen/b7bf6a794c3cb760ed32c3b0e4b29c27 to your computer and use it in GitHub Desktop.
Counting earthquakes in the contiguous United States with sqlite and group by

Counting earthquakes in the contiguous United States.

Dataset here: http://2016.padjo.org/tutorials/sqlite-data-starterpacks/#more-info-m3-0-earthquakes-in-the-contiguous-u-s-1995-through-2015

Let's start with a count of M3.0 earthquakes in the entire dataset (which spans from 1995 through 2015)

SELECT COUNT(*)
FROM quakes;
COUNT(*)
16941

Let's narrow that query to count only M4.0+ quakes:

SELECT COUNT(*)
FROM quakes
WHERE mag >= 4.0;
COUNT(*)
1945

When the COUNT aggregate function is used on its own, it crunches the dataset into a single scalar value, which is what we want: the total number of earthquakes.

But when we want more than just one count, such as a count by year, then we use GROUP BY

There is no year column in the dataset. But the time column follows standard UTC format, e.g. 2005-01-02T20:58:48.280Z

Which means SQLite will treat it the time column with formality (SQLite has no time/date datatype).

Let's ignore the aggregation for now and get a list of top 10 earthquakes by magnitude, time, and extracted year. The strftime function, confusing as it is, can be used here to extract the year:

SELECT mag, time, 
        strftime('%Y', time) AS year
FROM quakes
ORDER BY mag DESC
LIMIT 10;
mag time year
7.2 2010-04-04 22:40:42.360000 2010
7.1 1999-10-16 09:46:44.460000 1999
7.0 2012-04-12 07:15:48.500000 2012
6.9 2009-08-03 17:59:56.170000 2009
6.8 2001-02-28 18:54:32.830000 2001
6.7 2015-09-13 08:14:08.560000 2015
6.7 2010-10-21 17:53:13.570000 2010
6.6 2013-10-19 17:54:54.700000 2013
6.6 2006-01-04 08:32:32.400000 2006
6.6 1995-08-28 10:46:12.060000 1995

However, if strftime is overkill here (and maddeningly besides). SQLite treats the time column as a string, which means we can use SUBSTR to same effect:

SELECT mag, time, 
        SUBSTR(time, 1, 4) AS year
FROM quakes
ORDER BY mag DESC
LIMIT 10;

Going back to the aggregation problem; to find quakes per year, we needed to get a year value. Now that we have it, we can GROUP BY it:

SELECT 
  SUBSTR(time, 1, 4) AS year,
  COUNT(*) AS qcount
FROM quakes
GROUP BY year
ORDER BY year ASC;

The result:

year qcount
1995 657
1996 572
1997 618
1998 649
1999 1249
2000 559
2001 612
2002 585
2003 593
2004 851
2005 554
2006 529
2007 460
2008 604
2009 537
2010 2053
2011 799
2012 580
2013 724
2014 1460
2015 1696
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment