andy-esch/readme.md

## readme.md

      
    Raw
  

              readme.md
            
          
    Introduction to SQL and PostGIS in CartoDB

Andy Eschbacher, Map Scientist, CartoDB
Find this document here: http://bit.ly/cdb-berkeley-sql

We will use a PiratePad to share challenge solutions: http://piratepad.net/cdb-berkeley

Roughly the Structure of CartoDB


What is SQL?

Structured Query Language --> "ask questions of data"
Schema

Each dataset in CartoDB has a formal structure of columns with a specific data type and name, and rows that are uniquely identified by a column number cartodb_id.
SQL

SQL is deceptively simple: unlike other programming languages, SQL's complexity ramps up quickly. But the common part of the language is just a handful of keywords: SELECT, FROM, WHERE. You can do maybe 75% of what you'll ever need to do with these keywords.
A typical SQL statement asks for column names from a table given a constraint on the values in the columns.
For instance, we could ask: Give me all the names, heights, and favorite cookies of people in this room whose shoe size is between 10 and 11 and whose father's name is Peter.
The query would look like this:
SELECT
    name,
    height, 
    favorite_cookie
FROM
    everyone_in_this_room_table
WHERE
    shoe_size > 10 
   AND
    shoe_size < 11 
   AND 
    father_name = 'Peter'
This would give back the following result (probably):
|  name  |  height  |  favorite_cookie  |
-----------------------------------------
|  Andy  |   1.78   |      Pecan        |

Which SQL does CartoDB use?

CartoDB uses PostgreSQL because it allows for extensions. One extension specifically allowed for CartoDB to exist: PostGIS. This allows one to do geospatial analysis in a database framework. All queries that don't require admin-level access can be run in your CartoDB database.
PostGIS allows you to do queries like this:
SELECT superhero.name
FROM city, superhero
WHERE ST_Contains(city.geom, superhero.geom)
AND city.name = 'Gotham';
The key piece here is the ST_Contains() portion: it gives back true or false depending on whether the superhero is inside of the city. This can be handy when you have two datasets and want to count how many points from one dataset occur in a polygon from another dataset.
Getting help with SQL in CartoDB

If it's not spatial search for: 'postgresql ...' where the ... is what you need to know. For instance, 'postgresql string operators'.
If the data is spatial, search for: 'postgis ...'. For instance, 'postgis area'.
There is also GIS StackExchange/CartoDB has ~800 questions and answers. GIS SE/PostGIS has many more.
The documentation for these two languages are here:

PostGIS: http://postgis.net/docs/manual-2.1/
PostgreSQL: http://www.postgresql.org/docs/9.3/static/index.html

Our datasets

We'll be working from a dataset that Stace Maples of Stanford liberated from one of the clown unions. It gives us the clown names and zip codes of clowns around the United States.
Retrieve it by copying the following link and importing it into your account:
http://eschbacher.cartodb.com/api/v2/sql?q=SELECT%20*%20FROM%20clowns_of_america_international_member_dataset_anon&format=shp&filename=clowns_sql
Notice that this is a SQL API call: we're specifying the custom SQL with the q parameter, the download format with the format param, and the filename with filename param. We could do any arbitrary SQL query here, even against multiple tables at once.
Exploring SELECT and FROM

We can do basic queries to narrow down, or add to a temporary view of our table:
SELECT
  clown_name,
  city
FROM
  clowns_sql
This narrows down the columns displayed in our table, and does not include the geographical data for mapping.
If we want to map it and display only some of the data, we can do a query like this:
SELECT
  clown_name,
  city,
  the_geom_webmercator
FROM
  clowns_sql
WHERE to go next

We can discover a lot of the comparison operators by using the filters. The comparison operators evalute to true or false statements. We could chain multiple logically statements together with the AND and OR operators. For instance, a filter like this:
SELECT *
FROM cookie_preferences
WHERE (age > 13 OR age < 7) AND (favorite_cookie = 'Chocolate Chip'`)
against a table like this:
|  name  |  age  |  favorite_cookie  |
--------------------------------------
| Andy   |  33   |   Pecan           |
| Linden |  13   |   Chocolate Chip  |
| Xiu    |  12   |   Chocolate Chip  |
| Otto   |  49   |   Peanut Butter   |
| Lisa   |  6    |   Chocolate Chip  |

will produce the following:
|  name  |  age  |  favorite_cookie  |
--------------------------------------
| Lisa   |  6    |   Chocolate Chip  |

Notice that the filter conditions are working on numbers and strings. More information here on these comparison operators. Read more about the logical operators.
You can also have conditions on sets:
SELECT * 
FROM awesome_table
WHERE name IN ('Stace','Kim','David','Bill')
You can also use NOT IN for negation.
And test whether values are null or not:
SELECT *
FROM cat_memes
WHERE (cat_name IS null) And (youtube_url IS NOT null)
Two gems that don't show up in the filters are: LIKE and ILIKE. They match strings that are similar to another string, like this:
'abc' LIKE 'abc'    true
'abc' LIKE 'a%'     true
'Abc' ILIKE 'a%'    true
'abc' LIKE '_b_'    true
'abc' LIKE 'c'      false

ILIKE is especially good for messy data where capitalization is not standardized. You can read much more about pattern matching in PostgreSQL.
Challenge(s) #1


Find all clowns who have more than one name.
Find all clowns who have more than one word in their name (that is, there is a space), and are in a zip code +/- 100 from Berkeley's (94720).
Find all clowns that have an empty country field OR where the_geom is null-valued.
Find all clowns who are not in the United States and whose country is not an empty string.

Before moving on, let's reset our map to the default SQL (SELECT * FROM clown_table), which you can get by clicking 'clear view' at the bottom of the SQL tray if you have a custom SQL statement applied.
Let's style our map as follows:
#clowns{
  marker-fill-opacity: 0.9;
  marker-line-color: #FFF;
  marker-line-width: 0;
  marker-line-opacity: 1;
  marker-placement: point;
  marker-type: ellipse;
  marker-width: 10;
  marker-fill: #0F3B82;
  marker-allow-overlap: true;
  marker-comp-op: plus;
}
The composite operation plus adds the individual rgb channels of the markers to show something like an intensity map of the markers at locations.
Let's also configure our infowindows or hovers to show all the information associated with a click or hover event.
LIMIT the data

You can choose to show only a subset of your data by using the LIMIT keyword as follows:
SELECT
  clown_names,
  zipcode,
  the_geom_webmercator
FROM
  clowns_sql
ORDER BY zipcode DESC
LIMIT 25
Notice that ORDER BY should be included because SQL data tables have no natural order when executing queries unless ORDER BY is specified.
Aggregate functions

We have a problem with our data visualization. There are tons of clowns in the same zipcode. Since the zipcode centers are the rough locations of our clowns, they are stacked on top of each other, so our click events can't find all of the clowns.

A good strategy here is to group all of the clowns who share the same physical location. Luckily, we have a SQL keyword (or two) for that: GROUP BY.
GROUP BY partitions your data into discrete groups, allowing you to perform functions on a group-by-group basis. For instance, if we want to find the average age within a zipcode (if we were so lucky as to have that data here).
Looking at the PostgreSQL aggregate function page, we have several options for aggregating the date in our table.
Our goal is to display all of the names of clowns at a single location or zip code.
We can recast our datatable by aggregating the clowns that are at the same location and city. Within those groups, let's:

Build up a list of clown names (using string_agg)
Build up a list of zip codes (using string_agg)
Count the number of clowns residing there (using count(1))
Calculate the square root of the count for visualizing the number of clowns
Grab one of the cartodb_ids so we can enable click events

SELECT 
  the_geom_webmercator,
  city,
  string_agg(clown_name,', ') clown_names, 
  string_agg(zipcode::text,', ') zips,
  count(1) cnt,
  sqrt(count(1)) sqrtcnt,
  max(cartodb_id) cartodb_id
FROM 
  clowns_sql
GROUP BY
  the_geom_webmercator, city
ORDER BY
  cnt DESC
Click on a symbol and select all of the columns to be displayed in it, like this:

Next, let's configure the style to more effectively visualize the clowns in an area. We will create markers sized by the square root of the number of clowns in an area. We can do this with CartoCSS as follows:

#clowns{
  marker-fill-opacity: 0.9;
  marker-line-color: #FFF;
  marker-line-width: 0;
  marker-line-opacity: 1;
  marker-placement: point;
  marker-type: ellipse;
  marker-width: 5*[sqrtcnt];
  marker-fill: #0F3B82;
  marker-allow-overlap: true;
  marker-comp-op: plus;
  [zoom > 5] {
  	marker-width: 10*[sqrtcnt];
  }
}
Looking through our results, this is a much better map, but there are some empty strings and uniform capitalization. To make it more legible we could apply some string operations.
String operations

Now let's do a little more string cleaning.
For reference, PostgreSQL string operations are listed in the documentation here:
http://www.postgresql.org/docs/9.3/static/functions-string.html.
To handle the conditional nature of empty string versus not, we can use CASE statements (docs on them). These are almost identical to IF THEN statements:
SELECT 
    CASE is_scary
    WHEN true
      THEN clown_name || ' should not go to kids parties'
    ELSE name || ' can go to parties'
    END As go_to_party_or_not
FROM supplemental_clown_table
The || operators add strings together.
If our table looks like this:
|      name       |  is_scary  |
--------------------------------
| SNICKERS        |   true     |
| BISCUIT D CLOWN |   true     |
| FLOOZIE DOO     |   false    |
| PERFLESS        |   true     |
| DAFFY-DILLY     |   true     |

The SQL command above would produce:

|             go_to_party_or_not                |
-------------------------------------------------
| SNICKERS should not go to kids parties        |
| BISCUIT D CLOWN should not go to kids parties |
| FLOOZIE DOO can go to parties                 |
| PERFLESS should not go to kids parties        |
| DAFFY-DILLY should not go to kids parties     |

We can use these CASE statements inside of our string_agg function to process our empty strings ('') into UNKNOWN instead. We can also use the initcap function to only make the initial character of each word capitalized. Notice that we are compositing our operations more and more.
We can also use the keyword DISTINCT to remove duplicates from our zipcode list.
SELECT 
  string_agg(
    initcap(
      CASE clown_name 
      WHEN '' 
      THEN 'UNKNOWN' 
      ELSE clown_name 
      END
    ),
    ', '
  ) clown_names, 
  the_geom_webmercator,
  count(1) cnt,
  sqrt(count(1)) sqrtcnt,
  string_agg(DISTINCT zipcode::text,', ') zips,
  max(cartodb_id) cartodb_id
FROM 
  clowns_sql
GROUP BY
  city, the_geom_webmercator
ORDER BY
  cnt DESC
See working map here.
Challenge #2


For clowns with two names listed, choose only the first name in the list. Check the string functions for help
Calculate the average length of a clown's name for each location.
Show only the top 20 locations for clowns

JOINing datasets

Next, let's find out the number of clowns in each state. To do that we need to get some state polygons.
Let's go back to our dashboard, change to data view, and then hit 'New Dataset'. Search for 'states' and look for the dataset titled 'USA states'. Click on it once to import it into your account.
Now that we have some polygons to work with, we can find out how many points (clowns) intersect with each polygon (state). We can use the PostGIS functions, ST_Contains or ST_Intersects.
SELECT
  count(c.*) As cnt,
  ST_Transform(s.the_geom,3857) As the_geom_webmercator
FROM 
  ne_50m_admin_1_states As s
JOIN
  clowns_sql As c
ON ST_Contains(s.the_geom, c.the_geom)
GROUP BY s.the_geom
This is all well and good, but it may reflect more of a population map than a map showing the significance of a state's clown representation.
If our state dataset had population for each state, we could normalize by population. We don't have that data here, so we could opt instead to normalize by the area of the state, which we can calculate with PostGIS's ST_Area.
We're specifically interested in units of square meters, which we can obtain by casting our geometry to 'geography'... where some magic happens. All calculations are done on a sphere instead of on a plane.
We can get the number of clowns per square meter, but that turns out to be a very small number. Instead, we can get the number of clowns per 1000 square kilometer in a state. Since 1 km = 10^3 m, 1 km^2 = 10^6 m^2. Therefore, 1000 km^2 = 10^9 m^2.
Our query then looks like:
SELECT
  count(c.*) As cnt,
  1e9 * count(c.*) / ST_Area(s.the_geom::geography) As cnt_p_area,
  ST_Transform(s.the_geom,3857) As the_geom_webmercator
FROM 
  ne_50m_admin_1_states As s
JOIN
  clowns_sql As c
ON ST_Contains(s.the_geom, c.the_geom)
GROUP BY s.the_geom
clowns per area
SELECT
  count(c.*) As cnt,
  1e9 * count(c.*) / ST_Area(s.the_geom::geography) As cnt_p_area,
  s.name,
  ST_Transform(s.the_geom,3857) As the_geom_webmercator
FROM 
  ne_50m_admin_1_states As s
JOIN
  eschbacher.clowns_sql As c
ON ST_Contains(s.the_geom, c.the_geom)
GROUP BY s.the_geom, s.name
ORDER BY cnt_p_area DESC
Challenge #3
Find the average distance every clown is from the clown's state center. Hint: use ST_Distance(geom1,geom2) and ST_Centroid(geom)
Resources


CartoDB's free Map Academy: http://academy.cartodb.com/courses/04-sql-postgis.html
Codecademy's SQL course: https://www.codecademy.com/courses/learn-sql
Boundless's PostGIS tutorials: http://workshops.boundlessgeo.com/postgis-intro/