Skip to content

Instantly share code, notes, and snippets.

@pramsey
Last active November 13, 2023 18:33
Show Gist options
  • Save pramsey/85b5791f43cfae33ea59d67ff7c67c6b to your computer and use it in GitHub Desktop.
Save pramsey/85b5791f43cfae33ea59d67ff7c67c6b to your computer and use it in GitHub Desktop.
Spatial SQL and PostGIS

Spatial SQL and PostGIS

This set of instructions goes with this presentation deck.

Installing PostGIS

PGDG Yum Repository

https://yum.postgresql.org/

yum install postgis24_10

Ubuntu GIS

https://wiki.ubuntu.com/UbuntuGIS

sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable
sudo apt-get install postgis

GeoCoding

Set up our working database for this talk:

CREATE DATABASE pgopen;
\c pgopen
CREATE EXTENSION postgis;
CREATE LANGUAGE plpythonu;

Create example table:

CREATE TABLE customers (
  name TEXT,
  address TEXT,
  geom GEOMETRY(POINT, 4326)
);

INSERT INTO customers (name, address) VALUES 
  ('Paul', '144 Simcoe Street, Victoria, BC, Canada'),
  ('Hyatt', '655 Burrard St, Vancouver, BC'),
  ('Parc 55', '55 Cyril Magnin St, San Francisco, CA, USA');

Create example geocoding function:

CREATE OR REPLACE FUNCTION geocode(address text)
RETURNS text AS
$$
    from geopy.geocoders import Nominatim
    geolocator = Nominatim(user_agent="plpythonu")
    location = geolocator.geocode(address)
    wkt = 'SRID=4326;POINT(%g %g)' % ((location.longitude, location.latitude))
    return wkt 
$$
LANGUAGE plpythonu VOLATILE
COST 1000;

Try it out:

SELECT ST_AsGeoJson(geocode('144 Simcoe Street, Victoria, BC'));
  
SELECT geocode(address), address 
FROM customers
WHERE name = 'Paul';

UPDATE customers 
SET geom = geocode(address);

Language abstractions over geocoding APIs:

Loading GIS Data

Zip Code Tabulation Areas (Points)

The US Census provides a data set of all 5-digit zip code areas, this file is a point file with a point in the center of each area.

Use the CSV file URL (shortened here using bit.ly) and the PostgreSQL COPY command to load the table:

CREATE TABLE zcta (
  zip TEXT,
  longitude REAL,
  latitude REAL,
  geom GEOMETRY(Point, 4326)
);

-- Read data direct from URL using curl
COPY zcta (zip, latitude, longitude) 
FROM PROGRAM 'curl -L https://bit.ly/2ghU6eZ'
WITH (
  FORMAT csv,
  HEADER true
);

-- Add geometry values to geometry column
UPDATE zcta 
  SET geom = ST_SetSRID(ST_MakePoint(longitude, latitude), 4326);

-- Index for zip code
CREATE INDEX zcta_zip_idx 
  ON zcta (zip);

-- Cluster to re-write the table after update
CLUSTER zcta USING zcta_zip_idx;

-- Spatial index for geometry
CREATE INDEX zcta_geom_idx 
  ON zcta USING GIST (geom);

Zip Code Tabulation Areas (Polygons)

The US Census provides a shape file of the 5-digit ZCTA:

Unzip the file and load using shp2pgsql:

shp2pgsql -s 4326 \
  -I -D \
  cb_2017_us_zcta510_500k \
  zcta_polys \
  | psql postgresopen

Or load using ogr2ogr from GDAL:

ogr2ogr \
  -f "PostgreSQL" \
  -nlt MULTIPOLYGON \
  -nln zcta_polys \
  PG:"dbname=postgis24" \
  cb_2017_us_zcta510_500k.shp

American Fact Finder

There is also interesting demographic data at different geographic levels. We are loading at a ZCTA level here, but for a "real" analysis with detailed data a finer leval, like a census tract, would be appropriate.

https://factfinder.census.gov/

  • Click "Advanced Search"
  • Click "SHOW ME ALL"
  • Click "Geographies" tab on left
  • Select "5-Digit Zip Code Tabulation Area"
  • Click "Add To Your Selections"
  • Click "CLOSE X" in dialogue
  • Search for "B19001" for household income table
  • Select table and click "Download" button
  • Save ZIP file you receive
  • Unzip contents
  • Run the acs2pgsql.py script to convert files to SQL for loading into PostgreSQL

Or, just download and unzip the pre-fetched data for the one file we are using:

python acs2pgsql.py ACS_16_5YR_B19001_with_ann.csv | psql pgopen

Then create a spatial income table by joining the ACS data to the zipcode polygons:

DROP TABLE IF EXISTS zcta_polys_100k;
CREATE TABLE zcta_polys_100k AS 
SELECT 
  a.geo_id, 
  z.geom, 
  100.0 * (a.hd01_vd14 + a.hd01_vd15 + hd01_vd16 + hd01_vd17) / NULLIF(a.hd01_vd01, 0.0) AS over_100k_pct,
  a.hd01_vd14 + a.hd01_vd15 + hd01_vd16 + hd01_vd17 AS households_over_100k, 
  NULLIF(a.hd01_vd01, 0.0) AS all_households
FROM acs_16_5yr_b19001_with_ann a 
JOIN zcta_polys z ON (z.affgeoid10 = a.geo_id);

CREATE INDEX zcta_polys_100k_geom_x ON zcta_polys_100k 
  USING GIST (geom);

Viewing GIS Data

Viewers

Architectures

Analyzing GIS Data

Starbucks Data

Create a table suitable to hold the data

DROP TABLE IF EXISTS starbucks CASCADE;
CREATE TABLE starbucks (
  Brand TEXT,
  Store_Number integer,
  Name TEXT,
  Ownership_Type TEXT,
  Facility_ID TEXT,
  Features_Products TEXT,
  Features_Service TEXT,
  Features_Stations TEXT,
  Food_Region TEXT,
  Venue_Type TEXT,
  Phone_Number TEXT,
  Location TEXT,
  Street_Address TEXT,
  Street_Line_1 TEXT,
  Street_Line_2 TEXT,
  City TEXT,
  State TEXT,
  Zip TEXT,
  Country TEXT,
  Coordinates TEXT,
  Latitude REAL,
  Longitude REAL,
  Insert_Date TIMESTAMP
  );

Pipe the data directly from opendata.socrata.com to the COPY command:

-- Input data use mm/dd/yyyy format for dates
SET DATESTYLE = US;             

-- Input data are tab delimited
COPY starbucks 
  FROM PROGRAM 'curl -L https://bit.ly/2OQA1K9'
  WITH (
    FORMAT csv,
    DELIMITER E'\t',
    HEADER true
  );
                               
-- Add empty geometry column                                                          
ALTER TABLE starbucks 
  ADD COLUMN geom geometry(point, 4326);

UPDATE starbucks 
  SET geom = ST_SetSRID(ST_MakePoint(longitude, latitude), 4326);

CREATE INDEX starbucks_geom_idx 
  ON starbucks USING GIST (geom);

Simple Spatial Queries

Find the five nearest Starbucks locations to Pike Place Public Market:

WITH pt AS (
  SELECT ST_SetSRID(ST_MakePoint(-122.3421, 47.6101), 4326) AS pt
)
SELECT name, location, ST_Distance(pt, geom) 
FROM starbucks, pt
ORDER BY geom <-> pt
LIMIT 5

Comparing geometry and geography distance calculations:

WITH pts AS (
  SELECT 'POINT(-122.3421 47.6101)' AS p,
         'POINT(-122.3386 47.6093)' AS q
) 
SELECT ST_Distance(p, q) AS d_geom,
       ST_Distance(p::geography, q::geography) AS d_geog
FROM pts;

Convert the existing tables to US National Atlas Albers projection (EPSG:2163):

ALTER TABLE starbucks
  ALTER COLUMN geom TYPE Geometry(Point, 2163)
  USING ST_Transform(geom, 2163);

ALTER TABLE zcta
  ALTER COLUMN geom TYPE Geometry(Point, 2163)
  USING ST_Transform(geom, 2163);  
  
ALTER TABLE zcta_polys
  ALTER COLUMN geom TYPE Geometry(MultiPolygon, 2163)
  USING ST_Transform(geom, 2163);  

ALTER TABLE zcta_polys_100k
  ALTER COLUMN geom TYPE Geometry(MultiPolygon, 2163)
  USING ST_Transform(geom, 2163);  

REINDEX TABLE starbucks;
REINDEX TABLE zcta;
REINDEX TABLE zcta_polys;
REINDEX TABLE zcta_polys_100k;

Run nearest Starbucks query using the new SRID:

WITH pt AS (
  SELECT ST_Transform(ST_SetSRID(ST_MakePoint(-122.3421, 47.6101), 4326), 2163) AS pt
)
SELECT name, location, ST_Distance(pt, geom) 
FROM starbucks, pt
ORDER BY geom <-> pt
LIMIT 5;

Find all the Starbucks within 1km of Pike Place Market:

WITH pt AS (
  SELECT ST_Transform(ST_SetSRID(
    ST_MakePoint(-122.3421, 47.6101), 
    4326), 2163) AS pt
)
SELECT name, location, 
       ST_Distance(pt, geom) 
FROM starbucks, pt
WHERE ST_DWithin(geom, pt, 1000);

Find all the Starbucks within 1km of Pike Place Market using a self-join instead of a coordinate literal value:

SELECT a.name, a.location, 
       ST_Distance(a.geom, b.geom) 
FROM starbucks a
JOIN starbucks b
ON ST_DWithin(a.geom, b.geom, 1000)
WHERE b.name = 'Pike Place';

What is the average percentage of households over $100K income in the whole USA?

SELECT 100.0 * sum(households_over_100k) / sum(all_households) 
       AS pct_over_100k 
FROM zcta_polys_100k;

What is the average percentage of households over $100K income in zip codes that house a Starbucks?

WITH starbucks_zips AS (
  SELECT DISTINCT ON (geo_id) z.*
  FROM zcta_polys_100k z
  JOIN starbucks s
  ON ST_Intersects(s.geom, z.geom)
)
SELECT 100.0 * sum(households_over_100k) / sum(all_households) 
FROM starbucks_zips

What is the average percentage of houses over $100K income in zip codes that house more than one Starbucks?

WITH starbucks_zips AS (
  SELECT Count(*), 
    Min(households_over_100k) AS households_over_100k,
    Min(all_households) AS all_households 
  FROM zcta_polys_100k z
  JOIN starbucks s
  ON ST_Intersects(s.geom, z.geom)
  GROUP BY geo_id
  HAVING Count(*) > 1
)
SELECT 100.0 * sum(households_over_100k) / sum(all_households) 
FROM starbucks_zips

What Starbucks outlets have "suspect" zip code entries?

SELECT s.Facility_ID, s.name, 
       s.zip, z.zcta5ce10, 
       s.geom
FROM starbucks s
JOIN zcta_polys z 
ON ST_Intersects(z.geom, s.geom)
WHERE z.zcta5ce10 != split_part(s.zip, '-', 1);

Create the 1-digit zip code areas!

SELECT 
  ST_Union(geom) AS geom,
  left(zcta5ce10, 1) AS zip1
FROM zcta_polys
GROUP BY left(zcta5ce10, 1)

Add the 2-digit codes as an overlay?

SELECT 
  ST_Union(geom) AS geom,
  left(zcta5ce10, 2) AS zip1
FROM zcta_polys
GROUP BY left(zcta5ce10, 2)
View raw

(Sorry about that, but we can’t show files that are this big right now.)

This file has been truncated, but you can view the full file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment